使用 healthcheck 等待服务就绪

为什么需要 healthcheck

在微服务架构中，服务启动和完全就绪之间存在时间差。例如：

数据库：容器启动后，MySQL/PostgreSQL 可能还在初始化
Web 服务：可能正在加载配置或连接数据库
缓存：Redis/Aerospike 启动后需要加载数据

没有 healthcheck 的情况下，依赖这些服务的容器会在它们尚未就绪时尝试连接，导致连接失败和应用崩溃。

healthcheck 的原理

Docker 的 healthcheck 通过在容器中定期执行指定的命令来判断服务是否健康：

services:
  db:
    image: postgres:15
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 5
      start_period: 10s

命令执行机制

Docker 守护进程定期在容器中执行 test 指定的命令：

如果命令退出码为 0：标记为 healthy
如果命令退出码非 0：计为一次失败
连续失败达到 retries 次数：标记为 unhealthy

healthcheck 的状态

容器有三种健康状态：

# 查看健康状态
docker ps

# 输出示例
# CONTAINER ID   IMAGE           STATUS                            NAMES
# abc123         postgres:15     Up 2 minutes (healthy)            db
# def456         myapp           Up 2 minutes (starting)           app
# ghi789         nginx           Up 2 minutes (unhealthy)          web

也可以通过 inspect 查看详细信息：

docker inspect --format='{{json .State.Health}}' container_name

{
  "Status": "healthy",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2024-01-01T00:00:00Z",
      "End": "2024-01-01T00:00:01Z",
      "ExitCode": 0,
      "Output": "pg_isready: server is accepting connections"
    }
  ]
}

healthcheck 在 Compose 中的应用

基本配置

services:
  db:
    image: postgres:15
    healthcheck:
      test: ["CMD-SHELL", "pg_isready"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 30s

各种服务的 healthcheck 示例

services:
  # PostgreSQL
  postgres:
    image: postgres:15
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${DB_USER:-postgres}"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

  # MySQL
  mysql:
    image: mysql:8.0
    healthcheck:
      test: ["CMD", "mysqladmin", "ping", "-h", "localhost"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

  # Redis
  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 5

  # Nginx
  nginx:
    image: nginx:alpine
    healthcheck:
      test: ["CMD", "nginx", "-t"]
      interval: 30s
      timeout: 10s
      retries: 3

  # Custom HTTP healthcheck
  app:
    image: myapp
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

waiting-for-it 脚本

如果不使用 Compose 的 condition 功能（比如 Compose V1），可以使用 wait-for-it 脚本：

services:
  app:
    image: myapp
    depends_on:
      - db
    command: ["./wait-for-it.sh", "db:5432", "--", "npm", "start"]

wait-for-it.sh 脚本会轮询直到指定端口可连接，然后执行后面的命令。

结合 depends_on 条件等待

services:
  app:
    build: .
    depends_on:
      db:
        condition: service_healthy
      cache:
        condition: service_healthy
      migration:
        condition: service_completed_successfully

  db:
    image: postgres:15
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 10
      start_period: 30s

  cache:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s

  migration:
    image: myapp-migrate
    depends_on:
      db:
        condition: service_healthy

自定义健康检查脚本

对于复杂的健康检查逻辑，可以编写脚本：

services:
  app:
    image: myapp
    healthcheck:
      test: ["CMD", "bash", "/healthcheck.sh"]
      interval: 30s

healthcheck.sh：

#!/bin/bash
# 检查应用是否真正就绪

# 1. 检查端口是否监听
if ! nc -z localhost 3000; then
  exit 1
fi

# 2. 检查健康 API
status=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/health)
if [ "$status" != "200" ]; then
  exit 1
fi

# 3. 检查数据库连接
if ! curl -s http://localhost:3000/health/db | grep -q "ok"; then
  exit 1
fi

exit 0

面试要点

Q：healthcheck 的 start_period 参数有什么作用？

A：start_period 指定容器启动后的初始阶段。在这个时间段内，健康检查失败不计入失败计数。这给应用初始化（如数据库迁移、预热）留下了缓冲时间，避免因为初始化过程中的临时不可用而被标记为 unhealthy。

Q：健康检查从 healthy 变为 unhealthy 后会发生什么？

A：docker compose 模式下仅是改变了容器的状态标记，不会自动重启。在 swarm 模式下，编排器会自动停止并重新创建 unhealthy 的容器。

Q：healthcheck 对性能有影响吗？

A：有轻微影响，因为 Docker 守护进程会定期在容器中执行命令。建议合理设置 interval（通常 30s 足够），避免过于频繁的健康检查。

文章版权归作者所有，未经允许请勿转载。

THE END

Agent 智能体开发

使用 healthcheck 等待服务就绪

使用 healthcheck 等待服务就绪

为什么需要 healthcheck

healthcheck 的原理

命令执行机制

healthcheck 的状态

healthcheck 在 Compose 中的应用

基本配置

各种服务的 healthcheck 示例

waiting-for-it 脚本

结合 depends_on 条件等待

自定义健康检查脚本

面试要点

请登录后发表评论