基础模式 健康探针
基础模式 健康探针
2.3 Health Probe
2.3 健康探针
The Health Probe pattern is about how an application can communicate its health state to Kubernetes. To be fully automatable, a cloud-native application must be highly observable by allowing its state to be inferred so that Kubernetes can detect whether the application is up and whether it is ready to serve requests. These observations influence the lifecycle management of Pods and the way traffic is routed to the application.
Health Probe模式是关于应用程序如何将其健康状态传递给Kubernetes的。为了完全自动化,云原生应用程序必须具有高度的可观察性,允许推断其状态,以便Kubernetes能够检测应用程序是否已启动以及是否已准备好为请求提供服务。这些观察结果会影响POD的生命周期管理以及将流量路由到应用程序的方式。
2.3.1 Problem
2.3.1 问题
Kubernetes regularly checks the container process status and restarts it if issues are detected. However, from practice, we know that checking the process status is not sufficient to decide about the health of an application. In many cases, an application hangs, but its process is still up and running. For example, a Java application may throw an OutOfMemoryError and still have the JVM process running. Alternatively, an application may freeze because it runs into an infinite loop, deadlock, or some thrashing (cache, heap, process). To detect these kinds of situations, Kubernetes needs a reliable way to check the health of applications. That is, not to understand how an application works internally, but a check that indicates whether the application is functioning as expected and capable of serving consumers.
Kubernetes定期检查容器进程状态,并在检测到问题时重新启动。然而,从实践中我们知道,检查进程状态不足以决定应用程序的运行状况。在许多情况下,应用程序挂起,但其进程仍在运行。例如,Java应用程序可能抛出OutOfMemoryError,并且JVM进程仍在运行。或者,应用程序可能会冻结,因为它运行在无限循环、死锁或某些振荡(缓存、堆、进程)中。为了检测这类情况,Kubernetes需要一种可靠的方法来检查应用程序的运行状况。也就是说,不是为了了解应用程序在内部是如何工作的,而是为了检查应用程序是否按预期运行并能够为消费者服务。
2.3.2 Solution
2.3.2 解决方案
The software industry has accepted the fact that it is not possible to write bug-free code. Moreover, the chances for failure increase even more when working with distributed applications. As a result, the focus for dealing with failures has shifted from avoiding them to detecting faults and recovering. Detecting failure is not a simple task that can be performed uniformly for all applications, as all have different definitions of a failure. Also, various types of failures require different corrective actions.Transient failures may self-recover, given enough time, and some other failures may need a restart of the application. Let’s see the checks Kubernetes uses to detect and correct failures.
软件行业已经接受了这样一个事实,即不可能编写没有bug的代码。此外,在使用分布式应用程序时,出现故障的机会会增加得更多。因此,处理故障的重点已从避免故障转移到检测故障和恢复。检测故障并不是一项可以对所有应用程序统一执行的简单任务,因为所有应用程序都有不同的故障定义。此外,各种类型的故障需要不同的纠正措施。如果有足够的时间,瞬时故障可能会自我恢复,而其他一些故障可能需要重新启动应用程序。让我们看看Kubernetes用来检测和纠正故障的检查。
2.3.2.1 Process Health Checks
2.3.2.1 进程状态检测
A process health check is the simplest health check the Kubelet constantly performs on the container processes. If the container processes are not running, the container is restarted. So even without any other health checks, the application becomes slightly more robust with this generic check. If your application is capable of detecting any kind of failure and shutting itself down, the process health check is all you need. However, for most cases that is not enough and other types of health checks are also necessary.
进程运行状况检查是Kubelet经常对容器进程执行的最简单的运行状况检查。如果容器进程未运行,则重新启动容器。因此,即使没有任何其他健康检查,应用程序也会通过这种通用检查变得更加健壮。如果您的应用程序能够检测到任何类型的故障并自行关闭,那么您只需要进行流程健康检查。然而,在大多数情况下,这还不够,还需要进行其他类型的健康检查。
2.3.2.2 Liveness Probes
2.3.2.2 存活探针
If your application runs into some deadlock, it is still considered healthy from the process health check’s point of view. To detect this kind of issue and any other types of failure according to your application business logic, Kubernetes has liveness proberegular checks performed by the Kubelet agent that asks your container to confirm it is still healthy. It is important to have the health check performed from the outside rather than the in application itself, as some failures may prevent the application watchdog from reporting its failure. Regarding corrective action, this health check is similar to a process health check, since if a failure is detected, the container is restarted. However, it offers more flexibility regarding what methods to use for checking the application health, as follows:
如果你的应用程序运行中遇到一些死锁,从进程健康检查的角度来看,它仍然被认为是正常的。为了检测到这类问题以及相似的其他造成应用测序业务逻辑失效的情况,Kubernetes提供了由Kubelet代理执行的活动常规检查,该代理会要求容器确认其是否仍然正常。从外部而不是从应用程序本身执行运行状况检查非常重要,因为某些故障可能会阻止应用程序监视程序报告其故障。关于纠正措施,此健康检查类似于进程健康检查,因为如果检测到故障,容器将重新启动。但是,它在检查应用程序运行状况的方法方面提供了更大的灵活性,如下所示:
- HTTP probe performs an HTTP GET request to the container IP address and expects a successful HTTP response code between 200 and 399.
- HTTP 探针对容器IP地址执行HTTP GET请求,并期望成功的HTTP响应代码介于200和399之间。
- A TCP Socket probe assumes a successful TCP connection.
- TCP套接字探针假定TCP连接成功。
- An Exec probe executes an arbitrary command in the container kernel namespace and expects a successful exit code (0).
- 执行探针在容器内核命名空间中执行任意命令,并期望成功退出代码(0)。
An example HTTP-based liveness probe is shown in Example 4-1.
Example 4-1. Container with a liveness probe
apiVersion: v1
kind: Pod
metadata:
name: pod-with-liveness-check
spec:
containers:
- image: k8spatterns/random-generator:1.0
name: random-generator
env:
- name: DELAY_STARTUP
value: "20"
ports:
- containerPort: 8080
protocol: TCP
livenessProbe:
httpGet: #HTTP probe to a health-check endpoint
path: /actuator/health
port: 8080
initialDelaySeconds: 30
#Wait 30 seconds before doing the first liveness check to give the application some time to warm up
Depending on the nature of your application, you can choose the method that is most suitable for you. It is up to your implementation to decide when your application is considered healthy or not. However, keep in mind that the result of not passing a health check is restarting of your container. If restarting your container does not help, there is no benefit to having a failing health check as Kubernetes restarts your container without fixing the underlying issue.
根据应用程序的性质,你可以选择最适合自己的方法。什么时候应用程序被认为是健康的,这取决于你的实现。但是,请记住,未通过运行状况检查的结果是重新启动容器。如果重新启动容器没有帮助,那么运行状况检查失败没有任何好处,因为Kubernetes在没有修复潜在问题的情况下重新启动容器。
2.3.2.3 Readiness Probes
2.3.2.3 就绪探针
Liveness checks are useful for keeping applications healthy by killing unhealthy containers and replacing them with new ones. But sometimes a container may not be healthy, and restarting it may not help either. The most common example is when a container is still starting up and not ready to handle any requests yet. Or maybe a container is overloaded, and its latency is increasing, and you want it to shield itself from additional load for a while.
活动性探针有助于通过消除不健康的容器并用新容器替换它们来保持应用程序的健康。但有时容器可能不健康,重新启动也可能无济于事。最常见的例子是当容器仍在启动且尚未准备好处理任何请求时。或者一个容器过载了,它的延迟在增加,你希望它在一段时间内避免额外的负载。
For this kind of scenario, Kubernetes has readiness probes. The methods for performing readiness checks are the same as liveness checks (HTTP, TCP, Exec), but the corrective action is different. Rather than restarting the container, a failed readiness probe causes the container to be removed from the service endpoint and not receive any new traffic. Readiness probes signal when a container is ready so that it has some time to warm up before getting hit with requests from the service. It is also useful for shielding the container from traffic at later stages, as readiness probes are performed regularly, similarly to liveness checks. Example 4-2 shows how a readiness probe can be implemented by probing the existence of a file the application creates when it is ready for operations.
对于这种情况,Kubernetes提供了就绪探针。执行就绪性检查的方法与活动性检查(HTTP、TCP、Exec)相同,但纠正措施不同。失败的就绪性探测不会重新启动容器,而是导致容器从服务端点移除,并且不会接收任何新流量。就绪探针在容器准备就绪时发出信号,以便在收到来自服务的请求之前有时间预热。它还可用于在后期保护容器不受流量影响,因为就绪探针是定期执行的,类似于活动性检查。示例4-2显示了如何通过探测应用程序在准备运行时创建的文件的存在性来实现就绪探测。
Example 4-2. Container with readiness probe
apiVersion: v1
kind: Pod
metadata:
name: pod-with-readiness-check
spec:
containers:
- image: k8spatterns/random-generator:1.0
name: random-generator
readinessProbe:
exec:
command: [ "stat", "/var/run/random-generator-ready" ]
#Check for the existence of a file the application creates to indicate it’s ready to serve requests. stat returns an error if the file does not exist, letting the readiness check fail.
Again, it is up to your implementation of the health check to decide when your application is ready to do its job and when it should be left alone. While process health checks and liveness checks are intended to recover from the failure by restarting the container, the readiness check buys time for your application and expects it to recover by itself. Keep in mind that Kubernetes tries to prevent your container from receiving new requests (when it is shutting down, for example), regardless of whether the readiness check still passes after having received a SIGTERM signal.
同样,健康检查实现来决定应用程序何时准备好完成其工作以及何时应该让它单独运行。虽然进程运行状况检查和活动性检查旨在通过重新启动容器从故障中恢复,但就绪性检查为应用程序赢得了时间,并期望它能够自行恢复。请记住,Kubernetes试图阻止容器接收新请求(例如,当它关闭时),而不管在接收到SIGTERM信号后就绪性检查是否仍然通过。
In many cases, you have liveness and readiness probes performing the same checks.However, the presence of a readiness probe gives your container time to start up.Only by passing the readiness check is a Deployment considered to be successful, so that, for example, Pods with an older version can be terminated as part of a rolling update.
在许多情况下,活动性和就绪性探针执行相同的检查。但是,就绪探针的存在为容器提供了启动时间。只有通过就绪性检查,部署才会被视为成功,因此,例如,具有较旧版本的POD可以作为滚动更新的一部分终止。
The liveness and readiness probes are fundamental building blocks in the automation of cloud-native applications. Application frameworks such as Spring actuator, WildFly Swarm health check, Karaf health checks, or the MicroProfile spec for Java provide implementations for offering Health Probes.
存活探针和就绪探针是云原生应用程序自动化的基本构建块。Spring actuator、WildFly Swarm health check、Karaf health checks或MicroProfile spec for Java等应用程序框架提供了提供健康探针的实现。
2.3.3 Discussion
2.3.3 讨论
To be fully automatable, cloud-native applications must be highly observable by providing a means for the managing platform to read and interpret the application health, and if necessary, take corrective actions. Health checks play a fundamental role in the automation of activities such as deployment, self-healing, scaling, and others. However, there are also other means through which your application can provide more visibility about its health.
要实现完全自动化,云原生应用程序必须具有高度的可观察性,所以要为管理平台提供一种读取和解释应用程序运行状况的方法,并在必要时采取纠正措施。运行状况检查在自动化活动(如部署、自我修复、扩展等)中起着基础性作用。但是,应用程序还可以通过其他方式提供有关其运行状况的更多可见性。
The obvious and old method for this purpose is through logging. It is a good practice for containers to log any significant events to system out and system error and have these logs collected to a central location for further analysis. Logs are not typically used for taking automated actions, but rather to raise alerts and further investigations.A more useful aspect of logs is the postmortem analysis of failures and detecting unnoticeable errors.
为了达到这一目的,最明显也是最古老的方法是通过日志。容器将任何重大事件记录到system out和system error中,并将这些日志收集到中心位置以供进一步分析,这是一个很好的做法。日志通常不用于采取自动操作,而是用于发出警报和进一步调查。日志的一个更有用的方面是对故障进行事后分析和检测不可见的错误。
Apart from logging to standard streams, it is also a good practice to log the reason for exiting a container to /dev/termination-log. This location is the place where the container can state its last will before being permanently vanished. Figure 4-1 shows the possible options for how a container can communicate with the runtime platform.
除了记录到标准流之外,将退出容器的原因记录到/dev/termination log也是一种很好的做法。这个位置是容器在永久消失之前可以陈述其最后日志的地方。图4-1显示了容器如何与运行时平台通信的可能选项。
Figure 4-1. Container observability options
Containers provide a unified way for packaging and running applications by treating them like black boxes. However, any container that is aiming to become a cloudnative citizen must provide APIs for the runtime environment to observe the container health and act accordingly. This support is a fundamental prerequisite for automation of the container updates and lifecycle in a unified way, which in turn improves the system’s resilience and user experience. In practical terms, that means, as a very minimum, your containerized application must provide APIs for the different kinds of health checks (liveness and readiness).
容器通过将应用程序视为黑匣子来提供打包和运行应用程序的统一方式。但是,任何旨在成为云原生公民的容器都必须为运行时环境提供API,以观察容器的运行状况并相应地采取行动。这种支持是以统一的方式自动化容器更新和生命周期的基本前提,这反过来又提高了系统的弹性和用户体验。实际上,这意味着,容器化应用程序至少必须为不同类型的健康检查(活动性和就绪性)提供API。
Even better-behaving applications must also provide other means for the managing platform to observe the state of the containerized application by integrating with tracing and metrics-gathering libraries such as OpenTracing or Prometheus. Treat your application as a black box, but implement all the necessary APIs to help the platform observe and manage your application in the best way possible.
即使性能更好的应用程序也必须为管理平台提供其他手段,以便通过与跟踪和度量收集库(如OpenTracing或Prometheus)集成来观察容器化应用程序的状态。将应用程序视为一个黑盒子,但要实现所有必要的API,以帮助平台以最佳方式观察和管理应用程序。
The next pattern, Managed Lifecycle, is also about communication between applications and the Kubernetes management layer, but coming from the other direction. It’s about how your application gets informed about important Pod lifecycle events.
下一个模式,托管生命周期,也是关于应用程序和Kubernetes管理层之间的通信,但来自另一个方向。它是关于应用程序如何获得有关重要Pod生命周期事件的信息。