OpenClaw健康检查与自动重启机制

前言

对于 7x24 小时运行的 OpenClaw 服务，仅靠人工盯守是不现实的。你需要一套自动化的健康检查与恢复体系：能够持续检测服务状态，在故障发生时自动修复，并在无法修复时及时发出告警。本文将全面介绍 OpenClaw 的健康检查工具和自动重启机制。

一、openclaw doctor 诊断命令

1.1 基本用法

openclaw doctor 是 OpenClaw 内置的一站式诊断工具，会自动检查服务的各项关键指标：

openclaw doctor

输出示例：

OpenClaw Doctor - 健康诊断报告
================================

[服务状态]
  ✓ OpenClaw 守护进程运行中 (PID: 12345)
  ✓ 运行时间: 7d 3h 22m
  ✓ 版本: 1.2.0 (最新版本)

[网关检查]
  ✓ Gateway 端口 18789 可访问
  ✓ 健康检查端点响应正常 (12ms)
  ✓ Web Dashboard 可访问

[频道连接]
  ✓ WhatsApp: 已连接
  ✓ Telegram: 已连接
  ⚠ Discord: 重连中 (最后断连: 2分钟前)

[模型服务]
  ✓ Claude API: 可用 (延迟: 320ms)
  ✓ API Key: 有效
  ⚠ 本月 Token 使用: 85% (接近配额)

[资源使用]
  ✓ 内存: 198MB / 512MB (38%)
  ✓ CPU: 平均 3.2%
  ✓ 磁盘: 日志目录 256MB
  ✓ 文件描述符: 128 / 65536

[最近错误]
  ⚠ 过去1小时有 2 次 API 超时
  ✓ 过去24小时无严重错误

诊断结果: 基本健康 (2 个警告)

1.2 详细检查模式

# 运行所有检查并输出详细信息
openclaw doctor --verbose

# 只检查特定模块
openclaw doctor --check gateway
openclaw doctor --check channels
openclaw doctor --check model
openclaw doctor --check resources
openclaw doctor --check skills

# 输出 JSON 格式（便于脚本处理）
openclaw doctor --format json

1.3 在脚本中使用

openclaw doctor 的退出码可以在脚本中使用：

退出码	含义
0	一切正常
1	存在警告但服务可用
2	存在严重问题，服务可能受影响
3	服务不可用

#!/bin/bash
openclaw doctor --quiet
EXIT_CODE=$?

case $EXIT_CODE in
    0) echo "服务健康" ;;
    1) echo "存在警告，需要关注" ;;
    2) echo "严重问题，需要处理" ;;
    3) echo "服务不可用，紧急修复！" ;;
esac

二、健康检查端点

2.1 HTTP 健康检查

OpenClaw Gateway 提供多个健康检查端点：

# 基本存活检查（liveness）
curl -s http://localhost:18789/health
# {"status":"ok","uptime":604800}

# 就绪检查（readiness）
curl -s http://localhost:18789/health/ready
# {"status":"ready","channels":{"whatsapp":"connected","telegram":"connected"}}

# 详细状态
curl -s http://localhost:18789/health/detail

2.2 自定义健康检查标准

你可以在配置中定义什么条件算"健康"：

// ~/.config/openclaw/openclaw.json5
{
  "health": {
    // 至少需要一个频道连接才算就绪
    "minConnectedChannels": 1,
    // 内存使用超过此比例标记为不健康
    "maxMemoryPercent": 90,
    // 模型 API 连续失败次数超过此值标记为不健康
    "maxConsecutiveModelErrors": 5,
    // 健康检查超时
    "timeout": 5000
  }
}

三、Watchdog 看门狗机制

3.1 OpenClaw 内置 Watchdog

OpenClaw 自带看门狗功能，可以监控自身状态并在异常时自动恢复：

{
  "watchdog": {
    "enabled": true,
    // 检查间隔
    "interval": "60s",
    // 自动重启条件
    "restartOn": {
      // 内存超限自动重启
      "memoryExceeded": true,
      "memoryThreshold": "450MB",
      // 频道全部断连自动重启
      "allChannelsDisconnected": true,
      // 模型连续失败自动重启
      "consecutiveModelErrors": 10,
      // 响应超时率过高自动重启
      "timeoutRateThreshold": 0.5
    },
    // 重启冷却时间（避免频繁重启）
    "restartCooldown": "5m",
    // 最大连续重启次数
    "maxRestarts": 3,
    // 超出最大重启次数后的行为
    "onMaxRestartsExceeded": "alert"  // "alert" 或 "stop"
  }
}

3.2 外部 Watchdog 脚本

对于更可靠的监控，建议使用独立于 OpenClaw 进程之外的看门狗：

#!/bin/bash
# /usr/local/bin/openclaw-watchdog.sh

HEALTH_URL="http://localhost:18789/health"
LOG="/var/log/openclaw-watchdog.log"
MAX_FAILURES=3
FAILURE_COUNT=0

check_health() {
    RESPONSE=$(curl -sf --max-time 10 "$HEALTH_URL" 2>/dev/null)
    if [ $? -ne 0 ]; then
        return 1
    fi
    STATUS=$(echo "$RESPONSE" | jq -r '.status' 2>/dev/null)
    if [ "$STATUS" != "ok" ]; then
        return 1
    fi
    return 0
}

log_msg() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$LOG"
}

if ! check_health; then
    FAILURE_COUNT=$((FAILURE_COUNT + 1))
    log_msg "健康检查失败 (连续第 $FAILURE_COUNT 次)"

    if [ $FAILURE_COUNT -ge $MAX_FAILURES ]; then
        log_msg "连续 $MAX_FAILURES 次失败，正在重启服务..."
        openclaw restart
        sleep 20

        if check_health; then
            log_msg "重启成功，服务已恢复"
            FAILURE_COUNT=0
        else
            log_msg "重启后仍不可用，发送告警通知"
            # 发送告警（Telegram/邮件/Webhook）
            curl -s -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \
                -d chat_id="${CHAT_ID}" \
                -d text="OpenClaw 服务重启失败，需要人工介入！服务器: $(hostname)"
        fi
    fi
else
    FAILURE_COUNT=0
fi

配置定时执行：

chmod +x /usr/local/bin/openclaw-watchdog.sh

# 每 2 分钟执行一次
crontab -e
# */2 * * * * /usr/local/bin/openclaw-watchdog.sh

3.3 Systemd Watchdog 集成

如果使用 Systemd 管理 OpenClaw，可以利用 Systemd 的看门狗功能：

# /etc/systemd/system/openclaw.service
[Service]
WatchdogSec=120        # 看门狗超时：120秒
Restart=on-failure
RestartSec=10
StartLimitBurst=5
StartLimitIntervalSec=300

# 当看门狗超时或服务崩溃时执行
ExecStartPost=/usr/local/bin/openclaw-health-notify.sh start
ExecStopPost=/usr/local/bin/openclaw-health-notify.sh stop

OpenClaw 需要定期向 Systemd 发送心跳信号。在配置中启用：

{
  "watchdog": {
    "systemdNotify": true  // 自动向 systemd 发送看门狗心跳
  }
}

四、频道自动重连

4.1 内置重连机制

OpenClaw 对每个频道连接都有内置的自动重连逻辑：

{
  "channels": {
    "reconnect": {
      "enabled": true,
      // 初始重连延迟
      "initialDelay": "5s",
      // 最大重连延迟（指数退避上限）
      "maxDelay": "5m",
      // 最大重连尝试次数（0=无限）
      "maxAttempts": 0,
      // 退避系数
      "backoffFactor": 2
    }
  }
}

重连日志示例：

[INFO] [channel:discord] 连接断开，5s 后尝试重连 (1/∞)
[INFO] [channel:discord] 重连中...
[WARN] [channel:discord] 重连失败，10s 后重试 (2/∞)
[INFO] [channel:discord] 重连中...
[INFO] [channel:discord] 重连成功，中断时间: 18s

4.2 监控频道连接状态

# 查看所有频道连接状态
openclaw status

# 输出
# 频道状态:
#   WhatsApp  ✓ 已连接 (uptime: 7d)
#   Telegram  ✓ 已连接 (uptime: 7d)
#   Discord   ⚠ 重连中 (断连: 2m)

# 手动重连特定频道
openclaw channel reconnect discord

五、告警通知配置

5.1 内置告警通道

{
  "alerts": {
    "enabled": true,
    // 告警方式
    "channels": [
      {
        "type": "telegram",
        "botToken": "YOUR_BOT_TOKEN",
        "chatId": "YOUR_CHAT_ID"
      },
      {
        "type": "email",
        "smtp": {
          "host": "smtp.example.com",
          "port": 587,
          "user": "[email protected]",
          "pass": "password"
        },
        "to": "[email protected]"
      },
      {
        "type": "webhook",
        "url": "https://hooks.slack.com/services/xxx"
      }
    ],
    // 告警触发条件
    "rules": {
      "serviceDown": true,
      "channelDisconnected": true,
      "highMemory": true,
      "highErrorRate": true,
      "certificateExpiring": true
    }
  }
}

5.2 告警抑制

避免短时间内发送大量重复告警：

{
  "alerts": {
    // 相同类型告警的最小间隔
    "throttle": "15m",
    // 恢复通知
    "notifyOnRecover": true
  }
}

六、健康检查最佳实践

多层检查：内置 watchdog + 外部 cron 脚本 + Uptime Kuma，多重保障
合理间隔：健康检查不宜过于频繁，每 1-2 分钟一次即可
渐进式恢复：先尝试重连频道，再尝试重启服务，最后才发送人工告警
冷却期：设置重启冷却时间，避免进入"检测故障-重启-再故障-再重启"的死循环
定期演练：主动模拟故障（如 kill -9），验证自动恢复机制是否有效
保留日志：watchdog 脚本的操作日志要独立保存，方便事后复盘

一套可靠的健康检查与自动恢复机制，可以让你的 OpenClaw 服务实现接近 99.9% 的可用性，大幅减少人工运维负担。