Introduction
As a long-running AI assistant service, stability is paramount for OpenClaw. If the service goes down, all connected chat channels will stop responding. This article will walk you through building a complete monitoring and alerting system from scratch, ensuring issues are detected and addressed as quickly as possible.
1. Health Check Endpoints
OpenClaw Gateway runs on port 18789 by default and provides built-in health check endpoints.
1.1 Basic Health Check
# Check if the service is alive
curl -s http://localhost:18789/health
# Expected response
# {"status":"ok","uptime":3600,"version":"1.2.0"}
1.2 Detailed Status Check
# Get detailed status information, including channel connection states
curl -s http://localhost:18789/health/detail | jq .
Example response:
{
"status": "ok",
"uptime": 86400,
"version": "1.2.0",
"channels": {
"whatsapp": "connected",
"telegram": "connected",
"discord": "connected"
},
"model": {
"provider": "claude",
"status": "available"
},
"memory": {
"heapUsed": "128MB",
"heapTotal": "256MB"
}
}
2. Uptime Monitoring Solutions
2.1 Using Cron for Periodic Checks
The simplest monitoring approach is to periodically check the service status via crontab:
# Edit crontab
crontab -e
# Check every 5 minutes; send an email notification on failure
*/5 * * * * curl -sf http://localhost:18789/health > /dev/null || echo "OpenClaw service is down! Time: $(date)" | mail -s "OpenClaw Alert" [email protected]
If you want to automatically restart the service when a failure is detected:
#!/bin/bash
# Save as /usr/local/bin/openclaw-watchdog.sh
HEALTH_URL="http://localhost:18789/health"
LOG_FILE="/var/log/openclaw-watchdog.log"
if ! curl -sf --max-time 10 "$HEALTH_URL" > /dev/null 2>&1; then
echo "[$(date)] Health check failed, restarting OpenClaw..." >> "$LOG_FILE"
openclaw restart
sleep 15
if curl -sf --max-time 10 "$HEALTH_URL" > /dev/null 2>&1; then
echo "[$(date)] Restart successful, service recovered" >> "$LOG_FILE"
else
echo "[$(date)] Service still unavailable after restart, manual intervention required!" >> "$LOG_FILE"
# Send urgent notification
curl -s -X POST "https://api.telegram.org/bot<TOKEN>/sendMessage" \
-d chat_id="<CHAT_ID>" \
-d text="🚨 OpenClaw service restart failed, manual intervention needed!"
fi
fi
# Add to crontab, run every 3 minutes
*/3 * * * * /usr/local/bin/openclaw-watchdog.sh
2.2 Using Uptime Kuma for Monitoring
Uptime Kuma is an open-source monitoring tool with an attractive web interface.
# Deploy Uptime Kuma with Docker
docker run -d \
--name uptime-kuma \
--restart=always \
-p 3001:3001 \
-v uptime-kuma:/app/data \
louislam/uptime-kuma:latest
Add a monitor in Uptime Kuma:
| Setting | Value |
|---|---|
| Monitor Type | HTTP(s) |
| Name | OpenClaw Gateway |
| URL | http://your-server-ip:18789/health |
| Heartbeat Interval | 60 seconds |
| Retries | 3 |
| Timeout | 10 seconds |
| Expected Status Code | 200 |
3. Prometheus Metrics Collection
3.1 Enabling the Prometheus Endpoint
Enable Prometheus metrics export in the OpenClaw configuration file:
// ~/.config/openclaw/openclaw.json5
{
"monitoring": {
"prometheus": {
"enabled": true,
"port": 9191,
"path": "/metrics"
}
}
}
Restart the service for the configuration to take effect:
openclaw restart
3.2 Viewing Available Metrics
curl -s http://localhost:9191/metrics
Key metrics exported by OpenClaw:
| Metric Name | Type | Description |
|---|---|---|
openclaw_messages_total |
Counter | Total messages processed |
openclaw_response_duration_seconds |
Histogram | Response latency distribution |
openclaw_active_channels |
Gauge | Number of active channels |
openclaw_model_requests_total |
Counter | Model API call count |
openclaw_model_errors_total |
Counter | Model API error count |
openclaw_memory_heap_bytes |
Gauge | Node.js heap memory usage |
openclaw_skills_loaded |
Gauge | Number of loaded skills |
3.3 Configuring Prometheus Scraping
# prometheus.yml
scrape_configs:
- job_name: 'openclaw'
scrape_interval: 15s
static_configs:
- targets: ['localhost:9191']
metrics_path: '/metrics'
4. Grafana Visualization Dashboard
4.1 Installing Grafana
# Install via Docker
docker run -d \
--name grafana \
--restart=always \
-p 3000:3000 \
-v grafana-storage:/var/lib/grafana \
grafana/grafana-oss:latest
4.2 Creating an OpenClaw Dashboard
After adding Prometheus as a data source in Grafana, create the following key panels:
Message Processing Rate Panel (PromQL):
rate(openclaw_messages_total[5m])
Average Response Latency Panel:
histogram_quantile(0.95, rate(openclaw_response_duration_seconds_bucket[5m]))
Model Call Error Rate Panel:
rate(openclaw_model_errors_total[5m]) / rate(openclaw_model_requests_total[5m]) * 100
Memory Usage Panel:
openclaw_memory_heap_bytes / 1024 / 1024
4.3 Recommended Dashboard Layout
It is recommended to divide the dashboard into four sections:
- Overview Row: Service status, uptime, active channel count, loaded skills count
- Message Statistics Row: Message rate, message volume by channel, response latency distribution
- Model Calls Row: API call rate, error rate, call distribution by provider
- Resource Monitoring Row: Memory usage, CPU usage (requires Node Exporter), disk space
5. Alert Notification Configuration
5.1 Grafana Alert Rules
Configure alert rules in Grafana:
# Service unavailable alert
- alert: OpenClawDown
expr: up{job="openclaw"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "OpenClaw service unavailable"
description: "OpenClaw service has stopped responding for over 2 minutes"
# High error rate alert
- alert: OpenClawHighErrorRate
expr: rate(openclaw_model_errors_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "OpenClaw model call error rate is too high"
# Memory alert
- alert: OpenClawHighMemory
expr: openclaw_memory_heap_bytes > 512 * 1024 * 1024
for: 10m
labels:
severity: warning
annotations:
summary: "OpenClaw memory usage exceeds 512MB"
5.2 Telegram Alert Notifications
# Configure Telegram contact point in Grafana
# Settings -> Contact Points -> New
# Type: Telegram
# Bot Token: Your Bot Token
# Chat ID: Your Chat ID
5.3 Webhook Alert Notifications
If you use WeChat Work, DingTalk, or Feishu, you can configure Webhook notifications:
# WeChat Work Webhook example
curl -s -X POST "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"msgtype": "markdown",
"markdown": {
"content": "## OpenClaw Alert\n> Service status abnormal\n> Time: '"$(date)"'\n> Please investigate promptly"
}
}'
5.4 Email Alert Notifications
Enable email in the Grafana configuration file:
# /etc/grafana/grafana.ini
[smtp]
enabled = true
host = smtp.example.com:587
user = [email protected]
password = your_password
from_address = [email protected]
from_name = OpenClaw Monitor
6. Resource Usage Monitoring
6.1 Monitoring System Resources with Node Exporter
# Install Node Exporter
docker run -d \
--name node-exporter \
--restart=always \
--net="host" \
--pid="host" \
-v "/:/host:ro,rslave" \
quay.io/prometheus/node-exporter:latest \
--path.rootfs=/host
6.2 Key System Metrics
| Metric | Normal Range | Alert Threshold |
|---|---|---|
| CPU Usage | < 50% | > 80% for 5 minutes |
| Memory Usage | < 70% | > 90% for 5 minutes |
| Disk Usage | < 80% | > 90% |
| Network Connections | < 1000 | > 5000 |
6.3 OpenClaw Log Monitoring
Monitor error logs using the openclaw logs command:
# Monitor error logs in real time
openclaw logs | grep -i "error\|warn\|fail" --line-buffered
# Use Loki + Promtail for log collection (optional)
# Forward OpenClaw logs to Loki for centralized management and querying
7. Monitoring System Summary
A complete OpenClaw monitoring system should include the following layers:
┌─────────────────────────────────────────────┐
│ Alert Notification Layer │
│ Telegram / Email / Webhook / WeChat Work │
├─────────────────────────────────────────────┤
│ Visualization Layer │
│ Grafana Dashboard │
├─────────────────────────────────────────────┤
│ Data Collection Layer │
│ Prometheus + Node Exporter + Loki │
├─────────────────────────────────────────────┤
│ Service Layer │
│ OpenClaw Gateway (:18789) │
│ Prometheus Metrics (:9191) │
└─────────────────────────────────────────────┘
With the configuration above, you can achieve comprehensive monitoring of your OpenClaw service, ensuring that any anomaly is detected and addressed promptly. We recommend starting with the simplest cron health check and gradually upgrading to a full Prometheus + Grafana solution.