OpenClaw服务监控和告警配置教程

前言

OpenClaw 作为一个长期运行的 AI 助手服务，稳定性至关重要。一旦服务中断，所有接入的聊天频道都将失去响应。本文将从零开始，帮助你搭建一套完整的监控和告警体系，确保问题在第一时间被发现和处理。

一、健康检查端点

OpenClaw Gateway 默认在 18789 端口运行，并提供了内置的健康检查端点。

1.1 基本健康检查

# 检查服务是否存活
curl -s http://localhost:18789/health

# 预期返回
# {"status":"ok","uptime":3600,"version":"1.2.0"}

1.2 详细状态检查

# 获取详细状态信息，包含各频道连接状态
curl -s http://localhost:18789/health/detail | jq .

返回示例：

{
  "status": "ok",
  "uptime": 86400,
  "version": "1.2.0",
  "channels": {
    "whatsapp": "connected",
    "telegram": "connected",
    "discord": "connected"
  },
  "model": {
    "provider": "claude",
    "status": "available"
  },
  "memory": {
    "heapUsed": "128MB",
    "heapTotal": "256MB"
  }
}

二、Uptime 监控方案

2.1 使用 Cron 定时检查

最简单的监控方式是通过 crontab 定期检查服务状态：

# 编辑 crontab
crontab -e

# 每5分钟检查一次，失败时发送邮件通知
*/5 * * * * curl -sf http://localhost:18789/health > /dev/null || echo "OpenClaw 服务异常！时间：$(date)" | mail -s "OpenClaw Alert" [email protected]

如果你希望在检测到故障时自动重启服务：

#!/bin/bash
# 保存为 /usr/local/bin/openclaw-watchdog.sh

HEALTH_URL="http://localhost:18789/health"
LOG_FILE="/var/log/openclaw-watchdog.log"

if ! curl -sf --max-time 10 "$HEALTH_URL" > /dev/null 2>&1; then
    echo "[$(date)] 健康检查失败，正在重启 OpenClaw..." >> "$LOG_FILE"
    openclaw restart
    sleep 15
    if curl -sf --max-time 10 "$HEALTH_URL" > /dev/null 2>&1; then
        echo "[$(date)] 重启成功，服务已恢复" >> "$LOG_FILE"
    else
        echo "[$(date)] 重启后服务仍不可用，请人工介入！" >> "$LOG_FILE"
        # 发送紧急通知
        curl -s -X POST "https://api.telegram.org/bot<TOKEN>/sendMessage" \
            -d chat_id="<CHAT_ID>" \
            -d text="🚨 OpenClaw 服务重启失败，需要人工介入！"
    fi
fi

# 加入 crontab，每3分钟执行一次
*/3 * * * * /usr/local/bin/openclaw-watchdog.sh

2.2 使用 Uptime Kuma 监控

Uptime Kuma 是一个开源的监控工具，提供美观的 Web 界面。

# 使用 Docker 部署 Uptime Kuma
docker run -d \
  --name uptime-kuma \
  --restart=always \
  -p 3001:3001 \
  -v uptime-kuma:/app/data \
  louislam/uptime-kuma:latest

在 Uptime Kuma 中添加监控项：

配置项	值
监控类型	HTTP(s)
名称	OpenClaw Gateway
URL	`http://你的服务器IP:18789/health`
心跳间隔	60秒
重试次数	3
超时时间	10秒
预期状态码	200

三、Prometheus 指标采集

3.1 启用 Prometheus 端点

在 OpenClaw 配置文件中启用 Prometheus 指标导出：

// ~/.config/openclaw/openclaw.json5
{
  "monitoring": {
    "prometheus": {
      "enabled": true,
      "port": 9191,
      "path": "/metrics"
    }
  }
}

重启服务使配置生效：

openclaw restart

3.2 查看可用指标

curl -s http://localhost:9191/metrics

OpenClaw 导出的关键指标包括：

指标名	类型	说明
`openclaw_messages_total`	Counter	处理的消息总数
`openclaw_response_duration_seconds`	Histogram	响应耗时分布
`openclaw_active_channels`	Gauge	活跃频道数
`openclaw_model_requests_total`	Counter	模型 API 调用次数
`openclaw_model_errors_total`	Counter	模型 API 错误次数
`openclaw_memory_heap_bytes`	Gauge	Node.js 堆内存使用量
`openclaw_skills_loaded`	Gauge	已加载技能数

3.3 配置 Prometheus 采集

# prometheus.yml
scrape_configs:
  - job_name: 'openclaw'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:9191']
    metrics_path: '/metrics'

四、Grafana 可视化仪表盘

4.1 安装 Grafana

# Docker 方式安装
docker run -d \
  --name grafana \
  --restart=always \
  -p 3000:3000 \
  -v grafana-storage:/var/lib/grafana \
  grafana/grafana-oss:latest

4.2 创建 OpenClaw 仪表盘

在 Grafana 中添加 Prometheus 数据源后，创建以下关键面板：

消息处理速率面板（PromQL）：

rate(openclaw_messages_total[5m])

平均响应延迟面板：

histogram_quantile(0.95, rate(openclaw_response_duration_seconds_bucket[5m]))

模型调用错误率面板：

rate(openclaw_model_errors_total[5m]) / rate(openclaw_model_requests_total[5m]) * 100

内存使用面板：

openclaw_memory_heap_bytes / 1024 / 1024

4.3 推荐仪表盘布局

建议将仪表盘分为四个区域：

概览行：服务状态、在线时长、活跃频道数、已加载技能数
消息统计行：消息速率、按频道分组的消息量、响应延迟分布
模型调用行：API 调用速率、错误率、各提供商调用占比
资源监控行：内存使用、CPU 使用（需配合 Node Exporter）、磁盘空间

五、告警通知配置

5.1 Grafana 告警规则

在 Grafana 中配置告警规则：

# 服务不可用告警
- alert: OpenClawDown
  expr: up{job="openclaw"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "OpenClaw 服务不可用"
    description: "OpenClaw 服务已停止响应超过2分钟"

# 高错误率告警
- alert: OpenClawHighErrorRate
  expr: rate(openclaw_model_errors_total[5m]) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "OpenClaw 模型调用错误率过高"

# 内存告警
- alert: OpenClawHighMemory
  expr: openclaw_memory_heap_bytes > 512 * 1024 * 1024
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "OpenClaw 内存使用超过 512MB"

5.2 Telegram 告警通知

# 在 Grafana 中配置 Telegram 联系点
# 设置 -> 联系点 -> 新建
# 类型：Telegram
# Bot Token: 你的 Bot Token
# Chat ID: 你的 Chat ID

5.3 Webhook 告警通知

如果你使用企业微信、钉钉或飞书，可以配置 Webhook 通知：

# 企业微信 Webhook 示例
curl -s -X POST "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "msgtype": "markdown",
    "markdown": {
      "content": "## OpenClaw 告警\n> 服务状态异常\n> 时间：'"$(date)"'\n> 请及时检查"
    }
  }'

5.4 邮件告警通知

在 Grafana 配置文件中启用邮件：

# /etc/grafana/grafana.ini
[smtp]
enabled = true
host = smtp.example.com:587
user = [email protected]
password = your_password
from_address = [email protected]
from_name = OpenClaw Monitor

六、资源使用监控

6.1 使用 Node Exporter 监控系统资源

# 安装 Node Exporter
docker run -d \
  --name node-exporter \
  --restart=always \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  quay.io/prometheus/node-exporter:latest \
  --path.rootfs=/host

6.2 关键系统指标

关注指标	正常范围	告警阈值
CPU 使用率	< 50%	> 80% 持续5分钟
内存使用率	< 70%	> 90% 持续5分钟
磁盘使用率	< 80%	> 90%
网络连接数	< 1000	> 5000

6.3 OpenClaw 日志监控

结合 openclaw logs 命令监控异常日志：

# 实时监控错误日志
openclaw logs | grep -i "error\|warn\|fail" --line-buffered

# 使用 Loki + Promtail 收集日志（可选）
# 将 OpenClaw 日志转发到 Loki 进行集中管理和查询

七、监控体系总结

一个完整的 OpenClaw 监控体系应包含以下层级：

┌─────────────────────────────────────────────┐
│               告警通知层                       │
│    Telegram / Email / Webhook / 企业微信       │
├─────────────────────────────────────────────┤
│               可视化层                         │
│           Grafana Dashboard                  │
├─────────────────────────────────────────────┤
│               数据采集层                       │
│     Prometheus + Node Exporter + Loki        │
├─────────────────────────────────────────────┤
│               服务层                          │
│     OpenClaw Gateway (:18789)                │
│     Prometheus Metrics (:9191)               │
└─────────────────────────────────────────────┘

通过以上配置，你可以实现对 OpenClaw 服务的全方位监控，确保任何异常都能被及时发现并处理。建议先从最简单的 cron 健康检查开始，逐步升级到完整的 Prometheus + Grafana 方案。