OpenClaw Service Monitoring and Alerting Configuration Guide

Introduction

As a long-running AI assistant service, stability is paramount for OpenClaw. If the service goes down, all connected chat channels will stop responding. This article will walk you through building a complete monitoring and alerting system from scratch, ensuring issues are detected and addressed as quickly as possible.

1. Health Check Endpoints

OpenClaw Gateway runs on port 18789 by default and provides built-in health check endpoints.

1.1 Basic Health Check

# Check if the service is alive
curl -s http://localhost:18789/health

# Expected response
# {"status":"ok","uptime":3600,"version":"1.2.0"}

1.2 Detailed Status Check

# Get detailed status information, including channel connection states
curl -s http://localhost:18789/health/detail | jq .

Example response:

{
  "status": "ok",
  "uptime": 86400,
  "version": "1.2.0",
  "channels": {
    "whatsapp": "connected",
    "telegram": "connected",
    "discord": "connected"
  },
  "model": {
    "provider": "claude",
    "status": "available"
  },
  "memory": {
    "heapUsed": "128MB",
    "heapTotal": "256MB"
  }
}

2. Uptime Monitoring Solutions

2.1 Using Cron for Periodic Checks

The simplest monitoring approach is to periodically check the service status via crontab:

# Edit crontab
crontab -e

# Check every 5 minutes; send an email notification on failure
*/5 * * * * curl -sf http://localhost:18789/health > /dev/null || echo "OpenClaw service is down! Time: $(date)" | mail -s "OpenClaw Alert" [email protected]

If you want to automatically restart the service when a failure is detected:

#!/bin/bash
# Save as /usr/local/bin/openclaw-watchdog.sh

HEALTH_URL="http://localhost:18789/health"
LOG_FILE="/var/log/openclaw-watchdog.log"

if ! curl -sf --max-time 10 "$HEALTH_URL" > /dev/null 2>&1; then
    echo "[$(date)] Health check failed, restarting OpenClaw..." >> "$LOG_FILE"
    openclaw restart
    sleep 15
    if curl -sf --max-time 10 "$HEALTH_URL" > /dev/null 2>&1; then
        echo "[$(date)] Restart successful, service recovered" >> "$LOG_FILE"
    else
        echo "[$(date)] Service still unavailable after restart, manual intervention required!" >> "$LOG_FILE"
        # Send urgent notification
        curl -s -X POST "https://api.telegram.org/bot<TOKEN>/sendMessage" \
            -d chat_id="<CHAT_ID>" \
            -d text="🚨 OpenClaw service restart failed, manual intervention needed!"
    fi
fi

# Add to crontab, run every 3 minutes
*/3 * * * * /usr/local/bin/openclaw-watchdog.sh

2.2 Using Uptime Kuma for Monitoring

Uptime Kuma is an open-source monitoring tool with an attractive web interface.

# Deploy Uptime Kuma with Docker
docker run -d \
  --name uptime-kuma \
  --restart=always \
  -p 3001:3001 \
  -v uptime-kuma:/app/data \
  louislam/uptime-kuma:latest

Add a monitor in Uptime Kuma:

Setting	Value
Monitor Type	HTTP(s)
Name	OpenClaw Gateway
URL	`http://your-server-ip:18789/health`
Heartbeat Interval	60 seconds
Retries	3
Timeout	10 seconds
Expected Status Code	200

3. Prometheus Metrics Collection

3.1 Enabling the Prometheus Endpoint

Enable Prometheus metrics export in the OpenClaw configuration file:

// ~/.config/openclaw/openclaw.json5
{
  "monitoring": {
    "prometheus": {
      "enabled": true,
      "port": 9191,
      "path": "/metrics"
    }
  }
}

Restart the service for the configuration to take effect:

openclaw restart

3.2 Viewing Available Metrics

curl -s http://localhost:9191/metrics

Key metrics exported by OpenClaw:

Metric Name	Type	Description
`openclaw_messages_total`	Counter	Total messages processed
`openclaw_response_duration_seconds`	Histogram	Response latency distribution
`openclaw_active_channels`	Gauge	Number of active channels
`openclaw_model_requests_total`	Counter	Model API call count
`openclaw_model_errors_total`	Counter	Model API error count
`openclaw_memory_heap_bytes`	Gauge	Node.js heap memory usage
`openclaw_skills_loaded`	Gauge	Number of loaded skills

3.3 Configuring Prometheus Scraping

# prometheus.yml
scrape_configs:
  - job_name: 'openclaw'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:9191']
    metrics_path: '/metrics'

4. Grafana Visualization Dashboard

4.1 Installing Grafana

# Install via Docker
docker run -d \
  --name grafana \
  --restart=always \
  -p 3000:3000 \
  -v grafana-storage:/var/lib/grafana \
  grafana/grafana-oss:latest

4.2 Creating an OpenClaw Dashboard

After adding Prometheus as a data source in Grafana, create the following key panels:

Message Processing Rate Panel (PromQL):

rate(openclaw_messages_total[5m])

Average Response Latency Panel:

histogram_quantile(0.95, rate(openclaw_response_duration_seconds_bucket[5m]))

Model Call Error Rate Panel:

rate(openclaw_model_errors_total[5m]) / rate(openclaw_model_requests_total[5m]) * 100

Memory Usage Panel:

openclaw_memory_heap_bytes / 1024 / 1024

4.3 Recommended Dashboard Layout

It is recommended to divide the dashboard into four sections:

Overview Row: Service status, uptime, active channel count, loaded skills count
Message Statistics Row: Message rate, message volume by channel, response latency distribution
Model Calls Row: API call rate, error rate, call distribution by provider
Resource Monitoring Row: Memory usage, CPU usage (requires Node Exporter), disk space

5. Alert Notification Configuration

5.1 Grafana Alert Rules

Configure alert rules in Grafana:

# Service unavailable alert
- alert: OpenClawDown
  expr: up{job="openclaw"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "OpenClaw service unavailable"
    description: "OpenClaw service has stopped responding for over 2 minutes"

# High error rate alert
- alert: OpenClawHighErrorRate
  expr: rate(openclaw_model_errors_total[5m]) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "OpenClaw model call error rate is too high"

# Memory alert
- alert: OpenClawHighMemory
  expr: openclaw_memory_heap_bytes > 512 * 1024 * 1024
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "OpenClaw memory usage exceeds 512MB"

5.2 Telegram Alert Notifications

# Configure Telegram contact point in Grafana
# Settings -> Contact Points -> New
# Type: Telegram
# Bot Token: Your Bot Token
# Chat ID: Your Chat ID

5.3 Webhook Alert Notifications

If you use WeChat Work, DingTalk, or Feishu, you can configure Webhook notifications:

# WeChat Work Webhook example
curl -s -X POST "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "msgtype": "markdown",
    "markdown": {
      "content": "## OpenClaw Alert\n> Service status abnormal\n> Time: '"$(date)"'\n> Please investigate promptly"
    }
  }'

5.4 Email Alert Notifications

Enable email in the Grafana configuration file:

# /etc/grafana/grafana.ini
[smtp]
enabled = true
host = smtp.example.com:587
user = [email protected]
password = your_password
from_address = [email protected]
from_name = OpenClaw Monitor

6. Resource Usage Monitoring

6.1 Monitoring System Resources with Node Exporter

# Install Node Exporter
docker run -d \
  --name node-exporter \
  --restart=always \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  quay.io/prometheus/node-exporter:latest \
  --path.rootfs=/host

6.2 Key System Metrics

Metric	Normal Range	Alert Threshold
CPU Usage	< 50%	> 80% for 5 minutes
Memory Usage	< 70%	> 90% for 5 minutes
Disk Usage	< 80%	> 90%
Network Connections	< 1000	> 5000

6.3 OpenClaw Log Monitoring

Monitor error logs using the openclaw logs command:

# Monitor error logs in real time
openclaw logs | grep -i "error\|warn\|fail" --line-buffered

# Use Loki + Promtail for log collection (optional)
# Forward OpenClaw logs to Loki for centralized management and querying

7. Monitoring System Summary

A complete OpenClaw monitoring system should include the following layers:

┌─────────────────────────────────────────────┐
│           Alert Notification Layer           │
│    Telegram / Email / Webhook / WeChat Work  │
├─────────────────────────────────────────────┤
│           Visualization Layer                │
│           Grafana Dashboard                  │
├─────────────────────────────────────────────┤
│           Data Collection Layer              │
│     Prometheus + Node Exporter + Loki        │
├─────────────────────────────────────────────┤
│              Service Layer                   │
│     OpenClaw Gateway (:18789)                │
│     Prometheus Metrics (:9191)               │
└─────────────────────────────────────────────┘

With the configuration above, you can achieve comprehensive monitoring of your OpenClaw service, ensuring that any anomaly is detected and addressed promptly. We recommend starting with the simplest cron health check and gradually upgrading to a full Prometheus + Grafana solution.