Nacos的服务心跳
创始人
2024-03-27 19:33:31
0

nacos的实例分为临时实例和永久实例两种,相应的不同的实例会用有不同的心跳机制.
临时实例基于心跳方式做健康检测,永久实例是有Nacos主动探测实例状态.
可以通过在yaml文件配置.

spring:application:name: order-servicecloud:nacos:discovery:ephemeral: false # 设置实例为永久实例。true:临时; false:永久server-addr: 192.168.150.1:8845

Nacos提供的心跳的API接口为:/nacos/v1/ns/instance/beat

客户端

NacosNamingService这个接口实现了服务心跳的功能

    @Overridepublic void registerInstance(String serviceName, String groupName, Instance instance) throws NacosException {if (instance.isEphemeral()) {BeatInfo beatInfo = new BeatInfo();beatInfo.setServiceName(NamingUtils.getGroupedName(serviceName, groupName));beatInfo.setIp(instance.getIp());beatInfo.setPort(instance.getPort());beatInfo.setCluster(instance.getClusterName());beatInfo.setWeight(instance.getWeight());beatInfo.setMetadata(instance.getMetadata());beatInfo.setScheduled(false);beatInfo.setPeriod(instance.getInstanceHeartBeatInterval());// 发送心跳到 Nacos 服务beatReactor.addBeatInfo(NamingUtils.getGroupedName(serviceName, groupName), beatInfo);}serverProxy.registerService(NamingUtils.getGroupedName(serviceName, groupName), groupName, instance);}

BeatInfo
从上面的代码可以看到BeatInfo就是包含心跳需要的各种信息,


/*** @author nkorange*/
public class BeatInfo {private int port;private String ip;private double weight;private String serviceName;private String cluster;private Map metadata;private volatile boolean scheduled;private volatile long period;private volatile boolean stopped;
}

BeatReactor
这个类中维护了一个线程池;

    public BeatReactor(NamingProxy serverProxy, int threadCount) {this.serverProxy = serverProxy;executorService = new ScheduledThreadPoolExecutor(threadCount, new ThreadFactory() {@Overridepublic Thread newThread(Runnable r) {Thread thread = new Thread(r);thread.setDaemon(true);thread.setName("com.alibaba.nacos.naming.beat.sender");return thread;}});}

当调用addBeatInfo方法的时候,就会执行心跳:

    public void addBeatInfo(String serviceName, BeatInfo beatInfo) {NAMING_LOGGER.info("[BEAT] adding beat: {} to beat map.", beatInfo);String key = buildKey(serviceName, beatInfo.getIp(), beatInfo.getPort());BeatInfo existBeat = null;//fix #1733if ((existBeat = dom2Beat.remove(key)) != null) {existBeat.setStopped(true);}dom2Beat.put(key, beatInfo);// 利用线程池,定期执行心跳任务,周期为 beatInfo.getPeriod()// 心跳周期的默认值在 com.alibaba.nacos.api.common.Constants 类中// public static final long DEFAULT_HEART_BEAT_INTERVAL = TimeUnit.SECONDS.toMillis(5);// 可以看到是5秒,默认5秒一次心跳// BeatTask:是一个RunnableexecutorService.schedule(new BeatTask(beatInfo), beatInfo.getPeriod(), TimeUnit.MILLISECONDS);MetricsMonitor.getDom2BeatSizeMonitor().set(dom2Beat.size());}

BeatTask
心跳的任务封装在 BeatTask这个类中,是一个Runnable,其run方法如下:

        public void run() {if (beatInfo.isStopped()) {return;}// 获取心跳周期long nextTime = beatInfo.getPeriod();try {// 发送心跳JSONObject result = serverProxy.sendBeat(beatInfo, BeatReactor.this.lightBeatEnabled);long interval = result.getIntValue("clientBeatInterval");boolean lightBeatEnabled = false;if (result.containsKey(CommonParams.LIGHT_BEAT_ENABLED)) {lightBeatEnabled = result.getBooleanValue(CommonParams.LIGHT_BEAT_ENABLED);}BeatReactor.this.lightBeatEnabled = lightBeatEnabled;if (interval > 0) {nextTime = interval;}// 判断心跳结果int code = NamingResponseCode.OK;if (result.containsKey(CommonParams.CODE)) {code = result.getIntValue(CommonParams.CODE);}if (code == NamingResponseCode.RESOURCE_NOT_FOUND) {// 如果失败,则需要 重新注册实例Instance instance = new Instance();instance.setPort(beatInfo.getPort());instance.setIp(beatInfo.getIp());instance.setWeight(beatInfo.getWeight());instance.setMetadata(beatInfo.getMetadata());instance.setClusterName(beatInfo.getCluster());instance.setServiceName(beatInfo.getServiceName());instance.setInstanceId(instance.getInstanceId());instance.setEphemeral(true);try {serverProxy.registerService(beatInfo.getServiceName(),NamingUtils.getGroupName(beatInfo.getServiceName()), instance);} catch (Exception ignore) {// 捕获异常,什么都不干}}} catch (NacosException ne) {NAMING_LOGGER.error("[CLIENT-BEAT] failed to send beat: {}, code: {}, msg: {}",JSON.toJSONString(beatInfo), ne.getErrCode(), ne.getErrMsg());}executorService.schedule(new BeatTask(beatInfo), nextTime, TimeUnit.MILLISECONDS);}}

发送心跳

    public JSONObject sendBeat(BeatInfo beatInfo, boolean lightBeatEnabled) throws NacosException {if (NAMING_LOGGER.isDebugEnabled()) {NAMING_LOGGER.debug("[BEAT] {} sending beat to server: {}", namespaceId, beatInfo.toString());}// 组织请求参数Map params = new HashMap(8);String body = StringUtils.EMPTY;if (!lightBeatEnabled) {body = "beat=" + JSON.toJSONString(beatInfo);}params.put(CommonParams.NAMESPACE_ID, namespaceId);params.put(CommonParams.SERVICE_NAME, beatInfo.getServiceName());params.put(CommonParams.CLUSTER_NAME, beatInfo.getCluster());params.put("ip", beatInfo.getIp());params.put("port", String.valueOf(beatInfo.getPort()));// 发送请求,这个地址就是:/v1/ns/instance/beatString result = reqAPI(UtilAndComs.NACOS_URL_BASE + "/instance/beat", params, body, HttpMethod.PUT);return JSON.parseObject(result);}

服务端

对于临时实例,服务端代码分了两部分:
(1) InstanceController提供了一个接口,处理客户端的心跳请求
(2) 定时检测实例心跳是否按期执行
可以根据客户端发起心跳检测的接口找到在InstanceController类中,定义了一个方法来处理心跳请求:

    @CanDistro@PutMapping("/beat")@Secured(parser = NamingResourceParser.class, action = ActionTypes.WRITE)public JSONObject beat(HttpServletRequest request) throws Exception {JSONObject result = new JSONObject();result.put("clientBeatInterval", switchDomain.getClientBeatInterval());// 解析心跳的请求参数// 获取 serviceNameString serviceName = WebUtils.required(request, CommonParams.SERVICE_NAME);// 获取 namespaceIdString namespaceId = WebUtils.optional(request, CommonParams.NAMESPACE_ID,Constants.DEFAULT_NAMESPACE_ID);// 获取clusterNameString clusterName = WebUtils.optional(request, CommonParams.CLUSTER_NAME,UtilsAndCommons.DEFAULT_CLUSTER_NAME);// 获取ipString ip = WebUtils.optional(request, "ip", StringUtils.EMPTY);// 获取portint port = Integer.parseInt(WebUtils.optional(request, "port", "0"));String beat = WebUtils.optional(request, "beat", StringUtils.EMPTY);RsInfo clientBeat = null;if (StringUtils.isNotBlank(beat)) {clientBeat = JSON.parseObject(beat, RsInfo.class);}if (clientBeat != null) {if (StringUtils.isNotBlank(clientBeat.getCluster())) {clusterName = clientBeat.getCluster();}ip = clientBeat.getIp();port = clientBeat.getPort();}if (Loggers.SRV_LOG.isDebugEnabled()) {Loggers.SRV_LOG.debug("[CLIENT-BEAT] full arguments: beat: {}, serviceName: {}", clientBeat, serviceName);}// 尝试从 Nacos 注册表中 获取实例Instance instance = serviceManager.getInstance(namespaceId, serviceName, clusterName, ip, port);// 如果获取失败,说明心跳失败,实例尚未注册if (instance == null) {if (clientBeat == null) {// 对应客户端中,心跳失败,则注册实例的代码result.put(CommonParams.CODE, NamingResponseCode.RESOURCE_NOT_FOUND);return result;}instance = new Instance();instance.setPort(clientBeat.getPort());instance.setIp(clientBeat.getIp());instance.setWeight(clientBeat.getWeight());instance.setMetadata(clientBeat.getMetadata());instance.setClusterName(clusterName);instance.setServiceName(serviceName);instance.setInstanceId(instance.getInstanceId());instance.setEphemeral(clientBeat.isEphemeral());// 重新注册一个实例serviceManager.registerInstance(namespaceId, serviceName, instance);}// 尝试基于 namespaceId 和 serviceName 从注册表中获取 Service 服务Service service = serviceManager.getService(namespaceId, serviceName);// 如果不存在,说明服务不存在,返回404if (service == null) {throw new NacosException(NacosException.SERVER_ERROR,"service not found: " + serviceName + "@" + namespaceId);}if (clientBeat == null) {clientBeat = new RsInfo();clientBeat.setIp(ip);clientBeat.setPort(port);clientBeat.setCluster(clusterName);}// 如果心跳没问题,开始处理心跳结果service.processClientBeat(clientBeat);result.put(CommonParams.CODE, NamingResponseCode.OK);result.put("clientBeatInterval", instance.getInstanceHeartBeatInterval());result.put(SwitchEntry.LIGHT_BEAT_ENABLED, switchDomain.isLightBeatEnabled());return result;}

处理心跳请求

    public void processClientBeat(final RsInfo rsInfo) {ClientBeatProcessor clientBeatProcessor = new ClientBeatProcessor();clientBeatProcessor.setService(this);clientBeatProcessor.setRsInfo(rsInfo);HealthCheckReactor.scheduleNow(clientBeatProcessor);}

HealthCheckReactor就是对线程池的封装,关键在于ClientBeatProcessor这个类中,他是一个Runnable,其中run方法:

    public void run() {Service service = this.service;if (Loggers.EVT_LOG.isDebugEnabled()) {Loggers.EVT_LOG.debug("[CLIENT-BEAT] processing beat: {}", rsInfo.toString());}String ip = rsInfo.getIp();String clusterName = rsInfo.getCluster();int port = rsInfo.getPort();// 获取集群信息Cluster cluster = service.getClusterMap().get(clusterName);// 获取集群中的所有实例信息List instances = cluster.allIPs(true);for (Instance instance : instances) {// 找到心跳的这个实例if (instance.getIp().equals(ip) && instance.getPort() == port) {if (Loggers.EVT_LOG.isDebugEnabled()) {Loggers.EVT_LOG.debug("[CLIENT-BEAT] refresh beat: {}", rsInfo.toString());}// 更新实例的最后依一次心跳时间 lastBeat// lastBeat 是判断实例心跳是否过期的关键指标!instance.setLastBeat(System.currentTimeMillis());if (!instance.isMarked()) {if (!instance.isHealthy()) {instance.setHealthy(true);Loggers.EVT_LOG.info("service: {} {POS} {IP-ENABLED} valid: {}:{}@{}, region: {}, msg: client beat ok",cluster.getService().getName(), ip, port, cluster.getName(), UtilsAndCommons.LOCALHOST_SITE);getPushService().serviceChanged(service);}}}}}

心跳异常检测
在服务注册时,一定会创建一个Service对象,而Service中有一个init方法,会在注册的时候被调用

    public void init() {// 开启心跳检测的任务// 执行心跳检测的定时任务HealthCheckReactor.scheduleCheck(clientBeatCheckTask);// 遍历注册表中的集群for (Map.Entry entry : clusterMap.entrySet()) {entry.getValue().setService(this);// 完成集群初始化entry.getValue().init();}}
public static void scheduleCheck(ClientBeatCheckTask task) {// 5000ms一次,也就是5秒对实例的心跳状态做一次检测// task:是一个 RunnablefutureMap.putIfAbsent(task.taskKey(), EXECUTOR.scheduleWithFixedDelay(task, 5000, 5000, TimeUnit.MILLISECONDS));}

ClientBeatCheckTask

    public void run() {try {if (!getDistroMapper().responsible(service.getName())) {return;}if (!getSwitchDomain().isHealthCheckEnabled()) {return;}// 找到所有 临时 实例的列表List instances = service.allIPs(true);// first set health status of instances:for (Instance instance : instances) {// 判断时间间隔(当前时间 - 最后一次心跳时间)是否大于 心跳超时时间,默认15秒if (System.currentTimeMillis() - instance.getLastBeat() > instance.getInstanceHeartBeatTimeOut()) {if (!instance.isMarked()) {if (instance.isHealthy()) {// 如果超时,标记实例为不健康 healthy = falseinstance.setHealthy(false);Loggers.EVT_LOG.info("{POS} {IP-DISABLED} valid: {}:{}@{}@{}, region: {}, msg: client timeout after {}, last beat: {}",instance.getIp(), instance.getPort(), instance.getClusterName(), service.getName(),UtilsAndCommons.LOCALHOST_SITE, instance.getInstanceHeartBeatTimeOut(), instance.getLastBeat());// 发布实例状态变更的事件getPushService().serviceChanged(service);SpringContext.getAppContext().publishEvent(new InstanceHeartbeatTimeoutEvent(this, instance));}}}}if (!getGlobalConfig().isExpireInstance()) {return;}// then remove obsolete instances:for (Instance instance : instances) {if (instance.isMarked()) {continue;}// 判断心跳间隔(当前事件 - 最后一次心跳时间)是否大于 实例被删除的最长超时间,默认30秒if (System.currentTimeMillis() - instance.getLastBeat() > instance.getIpDeleteTimeout()) {// delete instanceLoggers.SRV_LOG.info("[AUTO-DELETE-IP] service: {}, ip: {}", service.getName(), JSON.toJSONString(instance));// 如果超过了 30 秒,则删除实例deleteIP(instance);}}} catch (Exception e) {Loggers.SRV_LOG.warn("Exception while processing client beat time out.", e);}}

主动健康检测
对于非实例,nacos会采用主动的健康检测,定时向实例发送请求,根据响应来判断实例健康状态.
入口是从ServiceManager类中的registerInstance方法
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
下面看一下集群初始化的init方法

    public void init() {if (inited) {return;}// 创建健康检测的任务checkTask = new HealthCheckTask(this);// 这里会开启对 非临时实例的 定时健康检测HealthCheckReactor.scheduleCheck(checkTask);inited = true;}

和上面的init方法一样,也是会创建一个任务HealthCheckTask,并且放到线程池里,进行定时检测

    public void run() {try {if (distroMapper.responsible(cluster.getService().getName()) &&switchDomain.isHealthCheckEnabled(cluster.getService().getName())) {// 开启健康检测healthCheckProcessor.process(this);// 记录日志if (Loggers.EVT_LOG.isDebugEnabled()) {Loggers.EVT_LOG.debug("[HEALTH-CHECK] schedule health check task: {}", cluster.getService().getName());}}} catch (Throwable e) {Loggers.SRV_LOG.error("[HEALTH-CHECK] error while process health check for {}:{}",cluster.getService().getName(), cluster.getName(), e);} finally {if (!cancelled) {// 结束后,再次进行任务调度,一定延迟后执行HealthCheckReactor.scheduleCheck(this);// worst == 0 means never checkedif (this.getCheckRTWorst() > 0&& switchDomain.isHealthCheckEnabled(cluster.getService().getName())&& distroMapper.responsible(cluster.getService().getName())) {// TLog doesn't support float so we must convert it into longlong diff = ((this.getCheckRTLast() - this.getCheckRTLastLast()) * 10000)/ this.getCheckRTLastLast();this.setCheckRTLastLast(this.getCheckRTLast());Cluster cluster = this.getCluster();if (Loggers.CHECK_RT.isDebugEnabled()) {Loggers.CHECK_RT.debug("{}:{}@{}->normalized: {}, worst: {}, best: {}, last: {}, diff: {}",cluster.getService().getName(), cluster.getName(), cluster.getHealthChecker().getType(),this.getCheckRTNormalized(), this.getCheckRTWorst(), this.getCheckRTBest(),this.getCheckRTLast(), diff);}}}}}

健康检测逻辑定义在里healthCheckProcessor.process(this);方法中,在HealthCheckProcessor中,这个接口的默认实现是TcpSuperSenseProcessor
在这里插入图片描述

    public void process(HealthCheckTask task) {// 获取所有 非临时实例的 集合List ips = task.getCluster().allIPs(false);if (CollectionUtils.isEmpty(ips)) {return;}for (Instance ip : ips) {if (ip.isMarked()) {if (SRV_LOG.isDebugEnabled()) {SRV_LOG.debug("tcp check, ip is marked as to skip health check, ip:" + ip.getIp());}continue;}if (!ip.markChecking()) {SRV_LOG.warn("tcp check started before last one finished, service: "+ task.getCluster().getService().getName() + ":"+ task.getCluster().getName() + ":"+ ip.getIp() + ":"+ ip.getPort());healthCheckCommon.reEvaluateCheckRT(task.getCheckRTNormalized() * 2, task, switchDomain.getTcpHealthParams());continue;}// 封装健康检测信息到 beatBeat beat = new Beat(ip, task);// 放入到一个阻塞队列中taskQueue.add(beat);MetricsMonitor.getTcpHealthCheckMonitor().incrementAndGet();}}

可以看到nacos中有很多这种操作,不是立即去执行,而是通过放到阻塞队列里面,进行异步执行.
因为TcpSuperSenseProcessor是一个Runnable,所以我们可以直接看他的run接口:

    public void run() {while (true) {try {// 处理任务processTask();int readyCount = selector.selectNow();if (readyCount <= 0) {continue;}Iterator iter = selector.selectedKeys().iterator();while (iter.hasNext()) {SelectionKey key = iter.next();iter.remove();NIO_EXECUTOR.execute(new PostProcessor(key));}} catch (Throwable e) {SRV_LOG.error("[HEALTH-CHECK] error while processing NIO task", e);}}}
    private void processTask() throws Exception {Collection> tasks = new LinkedList<>();do {// 取出beatBeat beat = taskQueue.poll(CONNECT_TIMEOUT_MS / 2, TimeUnit.MILLISECONDS);if (beat == null) {return;}// 将任务封装为一个TaskProcessor,并放入集合tasks.add(new TaskProcessor(beat));} while (taskQueue.size() > 0 && tasks.size() < NIO_THREAD_COUNT * 64);// 批量处理集合中的任务for (Future f : NIO_EXECUTOR.invokeAll(tasks)) {f.get();}}

接着看TaskProcessor,因为是一个callable的线程,所以直接看call方法

        public Void call() {// 获取检测任务已经等待的时长long waited = System.currentTimeMillis() - beat.getStartTime();if (waited > MAX_WAIT_TIME_MILLISECONDS) {Loggers.SRV_LOG.warn("beat task waited too long: " + waited + "ms");}SocketChannel channel = null;try {// 获取实例信息Instance instance = beat.getIp();Cluster cluster = beat.getTask().getCluster();BeatKey beatKey = keyMap.get(beat.toString());if (beatKey != null && beatKey.key.isValid()) {if (System.currentTimeMillis() - beatKey.birthTime < TCP_KEEP_ALIVE_MILLIS) {instance.setBeingChecked(false);return null;}beatKey.key.cancel();beatKey.key.channel().close();}// 通过NIO建立TCP连接channel = SocketChannel.open();channel.configureBlocking(false);// only by setting this can we make the socket close event asynchronouschannel.socket().setSoLinger(false, -1);channel.socket().setReuseAddress(true);channel.socket().setKeepAlive(true);channel.socket().setTcpNoDelay(true);int port = cluster.isUseIPPort4Check() ? instance.getPort() : cluster.getDefCkport();channel.connect(new InetSocketAddress(instance.getIp(), port));// 注册连接、读取事件SelectionKey key= channel.register(selector, SelectionKey.OP_CONNECT | SelectionKey.OP_READ);key.attach(beat);keyMap.put(beat.toString(), new BeatKey(key));beat.setStartTime(System.currentTimeMillis());NIO_EXECUTOR.schedule(new TimeOutTask(key),CONNECT_TIMEOUT_MS, TimeUnit.MILLISECONDS);} catch (Exception e) {beat.finishCheck(false, false, switchDomain.getTcpHealthParams().getMax(), "tcp:error:" + e.getMessage());if (channel != null) {try {channel.close();} catch (Exception ignore) {}}}return null;}}

Nacos的健康检测有两种模式:

  1. 临时实例:
    采用客户端心跳检测模式,心跳周期5秒
    心跳间隔超过15秒则标记为不健康
    心跳间隔超过30秒则从服务列表删除
  2. 永久实例:
    采用服务端主动健康检测方式
    周期为2000 + 5000毫秒内的随机数
    检测异常只会标记为不健康,不会删除

以淘宝为例,双十一大促期间,流量会比平常高出很多,此时服务肯定需要增加更多实例来应对高并发,而这些实例在双十一之后就无需继续使用了,采用临时实例比较合适。而对于服务的一些常备实例,则使用永久实例更合适。

与eureka相比,Nacos与Eureka在临时实例上都是基于心跳模式实现,差别不大,主要是心跳周期不同,eureka是30秒,Nacos是5秒。

另外,Nacos支持永久实例,而Eureka不支持,Eureka只提供了心跳模式的健康监测,而没有主动检测功能。

相关内容

热门资讯

【NI Multisim 14...   目录 序言 一、工具栏 🍊1.“标准”工具栏 🍊 2.视图工具...
AWSECS:访问外部网络时出... 如果您在AWS ECS中部署了应用程序,并且该应用程序需要访问外部网络,但是无法正常访问,可能是因为...
银河麒麟V10SP1高级服务器... 银河麒麟高级服务器操作系统简介: 银河麒麟高级服务器操作系统V10是针对企业级关键业务...
不能访问光猫的的管理页面 光猫是现代家庭宽带网络的重要组成部分,它可以提供高速稳定的网络连接。但是,有时候我们会遇到不能访问光...
AWSElasticBeans... 在Dockerfile中手动配置nginx反向代理。例如,在Dockerfile中添加以下代码:FR...
Android|无法访问或保存... 这个问题可能是由于权限设置不正确导致的。您需要在应用程序清单文件中添加以下代码来请求适当的权限:此外...
月入8000+的steam搬砖... 大家好,我是阿阳 今天要给大家介绍的是 steam 游戏搬砖项目,目前...
​ToDesk 远程工具安装及... 目录 前言 ToDesk 优势 ToDesk 下载安装 ToDesk 功能展示 文件传输 设备链接 ...
北信源内网安全管理卸载 北信源内网安全管理是一款网络安全管理软件,主要用于保护内网安全。在日常使用过程中,卸载该软件是一种常...
AWS管理控制台菜单和权限 要在AWS管理控制台中创建菜单和权限,您可以使用AWS Identity and Access Ma...