OpenClaw资源监控与告警配置

前言

了解 OpenClaw 的资源消耗情况是保障服务稳定性的关键。本文将介绍如何全方位监控 OpenClaw 的资源使用，包括内存、CPU、网络连接、消息处理量等指标，并配置当指标异常时的自动告警通知。

一、内置资源监控

1.1 openclaw stats 命令

OpenClaw 提供了内置的统计命令，可以快速查看资源使用概况：

# 查看当前资源使用
openclaw stats

# 输出示例
# ┌─────────────────────────────────────────┐
# │         OpenClaw 运行统计               │
# ├─────────────────────────────────────────┤
# │ 运行时间:     3d 12h 45m               │
# │ 进程 PID:     12345                     │
# │ 内存 (Heap):  168MB / 512MB (32%)      │
# │ 内存 (RSS):   245MB                     │
# │ CPU (1m avg):  2.3%                     │
# │ 活跃频道:     3 / 3                     │
# │ 今日消息:     342                       │
# │ 今日 Token:   125,800                   │
# │ 今日费用:     $1.85                     │
# │ 平均响应:     1.8s                      │
# │ 错误率:       0.3%                      │
# └─────────────────────────────────────────┘

1.2 实时监控面板

# 启动实时监控（类似 top 命令）
openclaw stats --live

# 自定义刷新间隔（秒）
openclaw stats --live --interval 5

实时面板会持续更新以下指标：

内存使用趋势图（ASCII 图表）
每分钟消息处理量
当前活跃连接数
API 调用延迟
错误计数

1.3 历史统计查询

# 查看过去 24 小时的消息统计
openclaw stats --period 24h

# 查看过去 7 天的资源趋势
openclaw stats --period 7d --metric memory

# 导出统计数据为 CSV
openclaw stats --period 30d --format csv > openclaw-stats.csv

二、HTTP API 监控接口

2.1 获取运行指标

# 基本运行指标
curl -s http://localhost:18789/health/stats | jq .

返回数据：

{
  "uptime": 302400,
  "memory": {
    "heapUsed": 168000000,
    "heapTotal": 536870912,
    "rss": 257000000,
    "external": 15000000
  },
  "cpu": {
    "user": 125000,
    "system": 45000,
    "percent": 2.3
  },
  "messages": {
    "today": 342,
    "thisHour": 28,
    "total": 15680
  },
  "tokens": {
    "today": {
      "input": 89500,
      "output": 36300
    }
  },
  "responseTime": {
    "avg": 1800,
    "p50": 1500,
    "p95": 3200,
    "p99": 5100
  },
  "errors": {
    "today": 3,
    "rate": 0.003
  }
}

2.2 频道级别统计

# 获取各频道的消息统计
curl -s http://localhost:18789/health/channels | jq .

{
  "channels": [
    {
      "name": "whatsapp",
      "status": "connected",
      "uptime": 302400,
      "messagesReceived": 180,
      "messagesSent": 175,
      "avgResponseTime": 1650,
      "errors": 1
    },
    {
      "name": "telegram",
      "status": "connected",
      "uptime": 302400,
      "messagesReceived": 120,
      "messagesSent": 118,
      "avgResponseTime": 1950,
      "errors": 2
    }
  ]
}

三、Prometheus 指标采集

3.1 启用 Prometheus 端点

// ~/.config/openclaw/openclaw.json5
{
  "monitoring": {
    "prometheus": {
      "enabled": true,
      "port": 9191,
      "path": "/metrics"
    }
  }
}

3.2 关键 Prometheus 指标

OpenClaw 导出的完整指标列表：

消息处理指标：

指标名	类型	说明
`openclaw_messages_received_total`	Counter	接收消息总数（按频道分标签）
`openclaw_messages_sent_total`	Counter	发送消息总数
`openclaw_messages_failed_total`	Counter	处理失败消息数

模型调用指标：

指标名	类型	说明
`openclaw_model_requests_total`	Counter	模型 API 调用总数
`openclaw_model_errors_total`	Counter	模型 API 错误数
`openclaw_model_duration_seconds`	Histogram	模型响应耗时分布
`openclaw_model_tokens_total`	Counter	Token 使用总量（input/output 标签）

资源指标：

指标名	类型	说明
`openclaw_memory_heap_bytes`	Gauge	堆内存使用量
`openclaw_memory_rss_bytes`	Gauge	驻留内存大小
`openclaw_active_connections`	Gauge	活跃连接数
`openclaw_queue_length`	Gauge	请求队列长度

3.3 实用 PromQL 查询

# 每分钟消息处理速率
rate(openclaw_messages_received_total[5m]) * 60

# 按频道统计消息量
sum by (channel) (increase(openclaw_messages_received_total[24h]))

# 模型调用 P95 延迟
histogram_quantile(0.95, rate(openclaw_model_duration_seconds_bucket[5m]))

# 错误率
rate(openclaw_model_errors_total[5m]) / rate(openclaw_model_requests_total[5m])

# 内存使用百分比
openclaw_memory_heap_bytes / openclaw_memory_heap_max_bytes * 100

# Token 消耗速率（每小时）
rate(openclaw_model_tokens_total[1h]) * 3600

四、告警规则配置

4.1 基于阈值的告警

在 OpenClaw 配置中设置告警规则：

{
  "alerts": {
    "enabled": true,
    "rules": [
      {
        "name": "高内存使用",
        "condition": "memory.heapPercent > 85",
        "duration": "10m",
        "severity": "warning",
        "message": "内存使用率超过 85%，当前: {value}%"
      },
      {
        "name": "服务响应慢",
        "condition": "responseTime.p95 > 5000",
        "duration": "5m",
        "severity": "warning",
        "message": "P95 响应时间超过 5 秒，当前: {value}ms"
      },
      {
        "name": "高错误率",
        "condition": "errors.rate > 0.05",
        "duration": "5m",
        "severity": "critical",
        "message": "错误率超过 5%，当前: {value}"
      },
      {
        "name": "频道断连",
        "condition": "channels.disconnected > 0",
        "duration": "3m",
        "severity": "critical",
        "message": "有 {value} 个频道断连"
      },
      {
        "name": "队列积压",
        "condition": "queue.length > 30",
        "duration": "2m",
        "severity": "warning",
        "message": "请求队列积压 {value} 条"
      }
    ]
  }
}

4.2 告警通知渠道

{
  "alerts": {
    "notifications": [
      {
        "type": "telegram",
        "botToken": "YOUR_BOT_TOKEN",
        "chatId": "YOUR_CHAT_ID",
        // 只接收 critical 级别告警
        "minSeverity": "critical"
      },
      {
        "type": "webhook",
        "url": "https://hooks.slack.com/services/xxx",
        "minSeverity": "warning"
      },
      {
        "type": "email",
        "to": "[email protected]",
        "minSeverity": "critical"
      }
    ],
    // 告警节流：相同告警最少间隔
    "throttle": "15m",
    // 恢复时发送通知
    "notifyOnResolve": true
  }
}

4.3 Grafana 告警规则

如果使用 Grafana，可以配置更灵活的告警：

# Grafana 告警规则
groups:
  - name: openclaw
    rules:
      - alert: OpenClawHighMemory
        expr: openclaw_memory_heap_bytes > 400 * 1024 * 1024
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "OpenClaw 内存使用过高"
          description: "堆内存: {{ $value | humanizeBytes }}"

      - alert: OpenClawMessageBacklog
        expr: openclaw_queue_length > 20
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "OpenClaw 消息队列积压"

      - alert: OpenClawDown
        expr: up{job="openclaw"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "OpenClaw 服务不可用"

五、消息量与成本统计

5.1 日报统计

# 查看今日统计摘要
openclaw stats --period today --summary

# 输出
# 今日统计 (2026-03-14)
# ──────────────────────
# 消息总量:     342
# Token 消耗:   125,800 (input: 89,500 / output: 36,300)
# 估算费用:     $1.85
# 平均响应:     1.8s
# 最慢响应:     6.2s
# 错误次数:     3 (0.9%)
# 活跃用户:     28

5.2 费用预测

# 查看本月费用趋势和预测
openclaw stats --cost --period month

# 输出
# 本月费用统计
# ──────────────────────
# 已消费:       $42.50
# 日均费用:     $3.04
# 月末预测:     $94.20
# 费用最高频道: WhatsApp ($22.30)
# 费用最高用户: user_abc ($8.50)

六、监控架构建议

根据你的部署规模，选择合适的监控方案：

个人/小团队（1-5用户）：

使用 openclaw stats 命令手动查看
cron + watchdog 脚本做基本健康检查
Telegram 告警通知

中等规模（5-50用户）：

开启 Prometheus 指标采集
部署 Grafana 仪表盘
配置多级告警规则
定期查看成本报告

大规模/企业级：

完整 Prometheus + Grafana + Alertmanager 体系
接入企业级监控平台（Datadog/New Relic）
日志集中收集（ELK/Loki）
SLA 监控和自动化运维

选择与你的规模匹配的监控方案，既不要过度工程化，也不要在关键指标上留下盲区。持续监控是保障 OpenClaw 稳定运行的基础。