OpenClaw Health Checks and Auto-Restart Mechanisms

Introduction

For an OpenClaw service running 24/7, relying on manual monitoring is impractical. You need an automated health check and recovery system that continuously monitors service status, automatically repairs faults when they occur, and promptly sends alerts when repairs are not possible. This article provides a comprehensive overview of OpenClaw's health check tools and auto-restart mechanisms.

1. The openclaw doctor Diagnostic Command

1.1 Basic Usage

openclaw doctor is OpenClaw's built-in all-in-one diagnostic tool that automatically checks various critical service metrics:

openclaw doctor

Example output:

OpenClaw Doctor - Health Diagnostic Report
================================

[Service Status]
  ✓ OpenClaw daemon running (PID: 12345)
  ✓ Uptime: 7d 3h 22m
  ✓ Version: 1.2.0 (latest)

[Gateway Check]
  ✓ Gateway port 18789 accessible
  ✓ Health check endpoint responding normally (12ms)
  ✓ Web Dashboard accessible

[Channel Connections]
  ✓ WhatsApp: Connected
  ✓ Telegram: Connected
  ⚠ Discord: Reconnecting (last disconnected: 2 minutes ago)

[Model Services]
  ✓ Claude API: Available (latency: 320ms)
  ✓ API Key: Valid
  ⚠ Monthly token usage: 85% (approaching quota)

[Resource Usage]
  ✓ Memory: 198MB / 512MB (38%)
  ✓ CPU: Average 3.2%
  ✓ Disk: Log directory 256MB
  ✓ File descriptors: 128 / 65536

[Recent Errors]
  ⚠ 2 API timeouts in the past hour
  ✓ No critical errors in the past 24 hours

Diagnosis result: Generally healthy (2 warnings)

1.2 Verbose Check Mode

# Run all checks with detailed output
openclaw doctor --verbose

# Check specific modules only
openclaw doctor --check gateway
openclaw doctor --check channels
openclaw doctor --check model
openclaw doctor --check resources
openclaw doctor --check skills

# Output in JSON format (convenient for scripting)
openclaw doctor --format json

1.3 Using in Scripts

openclaw doctor's exit codes can be used in scripts:

Exit Code	Meaning
0	Everything is normal
1	Warnings exist but the service is operational
2	Serious issues exist, service may be affected
3	Service is unavailable

#!/bin/bash
openclaw doctor --quiet
EXIT_CODE=$?

case $EXIT_CODE in
    0) echo "Service is healthy" ;;
    1) echo "Warnings exist, attention needed" ;;
    2) echo "Serious issues, action required" ;;
    3) echo "Service unavailable, urgent fix needed!" ;;
esac

2. Health Check Endpoints

2.1 HTTP Health Checks

OpenClaw Gateway provides multiple health check endpoints:

# Basic liveness check
curl -s http://localhost:18789/health
# {"status":"ok","uptime":604800}

# Readiness check
curl -s http://localhost:18789/health/ready
# {"status":"ready","channels":{"whatsapp":"connected","telegram":"connected"}}

# Detailed status
curl -s http://localhost:18789/health/detail

2.2 Custom Health Check Criteria

You can define what conditions constitute "healthy" in the configuration:

// ~/.config/openclaw/openclaw.json5
{
  "health": {
    // At least one channel must be connected to be considered ready
    "minConnectedChannels": 1,
    // Mark as unhealthy when memory usage exceeds this ratio
    "maxMemoryPercent": 90,
    // Mark as unhealthy after this many consecutive model API failures
    "maxConsecutiveModelErrors": 5,
    // Health check timeout
    "timeout": 5000
  }
}

3. Watchdog Mechanism

3.1 OpenClaw Built-in Watchdog

OpenClaw includes a built-in watchdog that can monitor its own status and automatically recover from anomalies:

{
  "watchdog": {
    "enabled": true,
    // Check interval
    "interval": "60s",
    // Auto-restart conditions
    "restartOn": {
      // Auto-restart on memory exceeded
      "memoryExceeded": true,
      "memoryThreshold": "450MB",
      // Auto-restart when all channels disconnect
      "allChannelsDisconnected": true,
      // Auto-restart after consecutive model failures
      "consecutiveModelErrors": 10,
      // Auto-restart when timeout rate is too high
      "timeoutRateThreshold": 0.5
    },
    // Restart cooldown (prevents frequent restarts)
    "restartCooldown": "5m",
    // Maximum consecutive restarts
    "maxRestarts": 3,
    // Behavior when max restarts exceeded
    "onMaxRestartsExceeded": "alert"  // "alert" or "stop"
  }
}

3.2 External Watchdog Script

For more reliable monitoring, it is recommended to use a watchdog running independently from the OpenClaw process:

#!/bin/bash
# /usr/local/bin/openclaw-watchdog.sh

HEALTH_URL="http://localhost:18789/health"
LOG="/var/log/openclaw-watchdog.log"
MAX_FAILURES=3
FAILURE_COUNT=0

check_health() {
    RESPONSE=$(curl -sf --max-time 10 "$HEALTH_URL" 2>/dev/null)
    if [ $? -ne 0 ]; then
        return 1
    fi
    STATUS=$(echo "$RESPONSE" | jq -r '.status' 2>/dev/null)
    if [ "$STATUS" != "ok" ]; then
        return 1
    fi
    return 0
}

log_msg() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$LOG"
}

if ! check_health; then
    FAILURE_COUNT=$((FAILURE_COUNT + 1))
    log_msg "Health check failed (consecutive failure #$FAILURE_COUNT)"

    if [ $FAILURE_COUNT -ge $MAX_FAILURES ]; then
        log_msg "$MAX_FAILURES consecutive failures, restarting service..."
        openclaw restart
        sleep 20

        if check_health; then
            log_msg "Restart successful, service recovered"
            FAILURE_COUNT=0
        else
            log_msg "Still unavailable after restart, sending alert"
            # Send alert (Telegram/Email/Webhook)
            curl -s -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \
                -d chat_id="${CHAT_ID}" \
                -d text="OpenClaw service restart failed, manual intervention needed! Server: $(hostname)"
        fi
    fi
else
    FAILURE_COUNT=0
fi

Configure scheduled execution:

chmod +x /usr/local/bin/openclaw-watchdog.sh

# Run every 2 minutes
crontab -e
# */2 * * * * /usr/local/bin/openclaw-watchdog.sh

3.3 Systemd Watchdog Integration

If using Systemd to manage OpenClaw, you can leverage Systemd's watchdog feature:

# /etc/systemd/system/openclaw.service
[Service]
WatchdogSec=120        # Watchdog timeout: 120 seconds
Restart=on-failure
RestartSec=10
StartLimitBurst=5
StartLimitIntervalSec=300

# Execute when watchdog times out or service crashes
ExecStartPost=/usr/local/bin/openclaw-health-notify.sh start
ExecStopPost=/usr/local/bin/openclaw-health-notify.sh stop

OpenClaw needs to periodically send heartbeat signals to Systemd. Enable this in the configuration:

{
  "watchdog": {
    "systemdNotify": true  // Automatically send watchdog heartbeats to systemd
  }
}

4. Channel Auto-Reconnection

4.1 Built-in Reconnection Mechanism

OpenClaw has built-in auto-reconnection logic for each channel connection:

{
  "channels": {
    "reconnect": {
      "enabled": true,
      // Initial reconnection delay
      "initialDelay": "5s",
      // Maximum reconnection delay (exponential backoff ceiling)
      "maxDelay": "5m",
      // Maximum reconnection attempts (0 = unlimited)
      "maxAttempts": 0,
      // Backoff factor
      "backoffFactor": 2
    }
  }
}

Reconnection log example:

[INFO] [channel:discord] Connection lost, attempting reconnection in 5s (1/∞)
[INFO] [channel:discord] Reconnecting...
[WARN] [channel:discord] Reconnection failed, retrying in 10s (2/∞)
[INFO] [channel:discord] Reconnecting...
[INFO] [channel:discord] Reconnected successfully, downtime: 18s

4.2 Monitoring Channel Connection Status

# View all channel connection statuses
openclaw status

# Output
# Channel Status:
#   WhatsApp  ✓ Connected (uptime: 7d)
#   Telegram  ✓ Connected (uptime: 7d)
#   Discord   ⚠ Reconnecting (disconnected: 2m)

# Manually reconnect a specific channel
openclaw channel reconnect discord

5. Alert Notification Configuration

5.1 Built-in Alert Channels

{
  "alerts": {
    "enabled": true,
    // Alert methods
    "channels": [
      {
        "type": "telegram",
        "botToken": "YOUR_BOT_TOKEN",
        "chatId": "YOUR_CHAT_ID"
      },
      {
        "type": "email",
        "smtp": {
          "host": "smtp.example.com",
          "port": 587,
          "user": "[email protected]",
          "pass": "password"
        },
        "to": "[email protected]"
      },
      {
        "type": "webhook",
        "url": "https://hooks.slack.com/services/xxx"
      }
    ],
    // Alert trigger conditions
    "rules": {
      "serviceDown": true,
      "channelDisconnected": true,
      "highMemory": true,
      "highErrorRate": true,
      "certificateExpiring": true
    }
  }
}

5.2 Alert Throttling

Prevent sending a flood of duplicate alerts in a short period:

{
  "alerts": {
    // Minimum interval between same-type alerts
    "throttle": "15m",
    // Recovery notification
    "notifyOnRecover": true
  }
}

6. Health Check Best Practices

Multi-layer Checks: Built-in watchdog + external cron script + Uptime Kuma for multi-layer protection
Reasonable Intervals: Health checks should not be too frequent; once every 1-2 minutes is sufficient
Progressive Recovery: First try reconnecting channels, then restart the service, and only as a last resort send a manual alert
Cooldown Period: Set restart cooldowns to avoid a "detect failure → restart → fail again → restart again" death loop
Regular Drills: Proactively simulate failures (e.g., kill -9) to verify that auto-recovery mechanisms work
Preserve Logs: Watchdog script operation logs should be saved independently for post-incident analysis

A reliable health check and auto-recovery system can help your OpenClaw service achieve close to 99.9% availability, significantly reducing manual operations overhead.