Error Handling and Auto-Recovery Configuration

Error Handling Overview

AI services face multiple potential failures: provider API outages, rate limits, network timeouts, channel disconnections, and more. Proper error handling configuration can significantly improve service availability.

Provider Error Handling

Auto-Retry

{
  "providers": {
    "openai": {
      "retry": {
        "maxAttempts": 3,
        "initialDelay": 1000,
        "maxDelay": 10000,
        "backoffMultiplier": 2,
        "retryableErrors": [429, 500, 502, 503, 504, "ECONNRESET", "ETIMEDOUT"]
      }
    }
  }
}

Retries use exponential backoff: 1s, 2s, 4s.

Model Failover

{
  "models": {
    "primary": {
      "provider": "openai",
      "model": "gpt-4o",
      "fallback": "secondary"
    },
    "secondary": {
      "provider": "anthropic",
      "model": "claude-sonnet-4-20250514",
      "fallback": "local"
    },
    "local": {
      "provider": "ollama",
      "model": "llama3.1:8b"
    }
  }
}

Failover chain: gpt-4o -> Claude -> Llama (local)

Circuit Breaker Pattern

When a provider fails repeatedly, temporarily stop sending requests:

{
  "providers": {
    "openai": {
      "circuitBreaker": {
        "enabled": true,
        "failureThreshold": 5,
        "resetTimeout": 60000,
        "halfOpenRequests": 2
      }
    }
  }
}

After 5 consecutive failures, the circuit breaker opens
After 60 seconds, it enters a half-open state
2 probe requests are allowed through
If successful, the circuit breaker closes

Rate Limit Handling

{
  "providers": {
    "openai": {
      "rateLimit": {
        "respectHeaders": true,
        "queueOverflow": true,
        "maxQueueSize": 100,
        "queueTimeout": 30000
      }
    }
  }
}

When a 429 error is received, OpenClaw will:

Read the Retry-After header
Queue the request
Resend after the specified wait time

Channel Error Handling

Auto-Reconnect

{
  "channels": {
    "whatsapp-main": {
      "reconnect": {
        "enabled": true,
        "maxAttempts": 10,
        "interval": 30000,
        "backoff": true,
        "notifyOnDisconnect": true,
        "notifyChannel": "telegram-admin"
      }
    }
  }
}

Webhook Error Handling

{
  "channels": {
    "telegram": {
      "webhook": {
        "errorResponse": {
          "onProviderError": "retry",
          "onTimeout": "apologize",
          "onRateLimit": "queue"
        },
        "errorMessages": {
          "providerDown": "Sorry, the AI service is temporarily unavailable. Please try again later.",
          "rateLimit": "Too many requests. Please wait a moment.",
          "timeout": "Processing timed out. Please resend your message."
        }
      }
    }
  }
}

Global Error Handling

{
  "errorHandling": {
    "global": {
      "uncaughtException": "restart",
      "unhandledRejection": "log",
      "memoryLimit": {
        "threshold": "450MB",
        "action": "restart"
      }
    },
    "notifications": {
      "enabled": true,
      "channels": ["telegram-admin"],
      "minSeverity": "error",
      "cooldown": 300
    }
  }
}

Graceful Degradation

When primary features are unavailable, provide degraded service:

{
  "degradation": {
    "rules": [
      {
        "condition": "provider.openai.down",
        "actions": [
          {"switch": "model", "to": "local"},
          {"disable": "tools", "except": ["basic"]},
          {"notify": "admin"}
        ]
      },
      {
        "condition": "memory.high",
        "actions": [
          {"reduce": "maxHistory", "to": 5},
          {"disable": "memory_search"}
        ]
      }
    ]
  }
}

Error Log Analysis

openclaw errors stats --period 24h

Error Statistics (last 24h):
  Total errors: 23
  By type:
    Provider timeout: 12 (52%)
    Rate limit: 8 (35%)
    Channel disconnect: 2 (9%)
    Unknown: 1 (4%)

  Recovery:
    Auto-retried: 18 (78%)
    Fell back: 3 (13%)
    Failed: 2 (9%)

Test Error Handling

# Simulate a provider failure
openclaw test failover --provider openai

# Simulate a timeout
openclaw test timeout --model main --duration 30

# Simulate rate limiting
openclaw test rate-limit --provider openai

Summary

Reliable error handling is the backbone of OpenClaw's high availability. By combining auto-retry, failover, circuit breakers, and graceful degradation, you can achieve 99.9%+ service uptime. The key is having a response plan for every possible failure scenario.