Introduction
When your OpenClaw service needs to handle a large number of concurrent requests, or you need to achieve zero-downtime deployments, a single-instance deployment is no longer sufficient. OpenClaw supports multi-instance deployment mode, achieving horizontal scaling and high availability by running multiple Gateway instances behind a load balancer.
This article covers the architecture design, configuration methods, and operational practices for multi-instance deployment.
Multi-Instance Architecture Overview
┌──────────────┐
│ Nginx / LB │
└──────┬───────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ OpenClaw │ │ OpenClaw │ │ OpenClaw │
│ Instance │ │ Instance │ │ Instance │
│ #1 │ │ #2 │ │ #3 │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
└──────────────┼──────────────┘
▼
┌──────────────┐
│ Shared Store │
│ (Redis/NFS) │
└──────────────┘
The core challenge of multi-instance deployment is sharing session state. OpenClaw provides two solutions: shared filesystem and Redis session storage.
Shared Storage Configuration
Option 1: Shared Filesystem (NFS / Mounted Volumes)
The simplest approach is to mount the session directory on a shared filesystem:
{
storage: {
// All instances point to the same shared directory
dataDir: "/mnt/shared/openclaw-data",
sessions: {
dir: "/mnt/shared/openclaw-data/sessions"
}
}
}
Docker Compose example:
version: "3.8"
services:
openclaw-1:
image: openclaw/gateway:latest
ports:
- "3001:3000"
volumes:
- shared-data:/data/openclaw
environment:
- OPENCLAW_DATA_DIR=/data/openclaw
- OPENCLAW_INSTANCE_ID=node-1
openclaw-2:
image: openclaw/gateway:latest
ports:
- "3002:3000"
volumes:
- shared-data:/data/openclaw
environment:
- OPENCLAW_DATA_DIR=/data/openclaw
- OPENCLAW_INSTANCE_ID=node-2
openclaw-3:
image: openclaw/gateway:latest
ports:
- "3003:3000"
volumes:
- shared-data:/data/openclaw
environment:
- OPENCLAW_DATA_DIR=/data/openclaw
- OPENCLAW_INSTANCE_ID=node-3
nginx:
image: nginx:alpine
ports:
- "3000:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
depends_on:
- openclaw-1
- openclaw-2
- openclaw-3
volumes:
shared-data:
driver: local
Option 2: Redis Session Storage
For higher concurrency scenarios, Redis is recommended as the session storage backend:
{
storage: {
backend: "redis",
redis: {
url: "redis://redis-host:6379",
password: "your-redis-password",
db: 0,
// Key prefix for session data
keyPrefix: "openclaw:",
// Connection pool size
poolSize: 10
}
}
}
The Redis approach has the advantage of atomic operations guaranteeing concurrent write safety, along with faster read/write speeds.
Nginx Load Balancing Configuration
Basic Round-Robin Strategy
upstream openclaw_backend {
server openclaw-1:3000;
server openclaw-2:3000;
server openclaw-3:3000;
}
server {
listen 80;
server_name openclaw.example.com;
location / {
proxy_pass http://openclaw_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
# SSE streaming responses require special configuration
location /api/v1/chat/stream {
proxy_pass http://openclaw_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 300s;
}
}
IP Hash Strategy (Session Affinity)
If you're using a shared filesystem and want to reduce file lock contention, configure IP Hash to route requests from the same user to the same instance:
upstream openclaw_backend {
ip_hash;
server openclaw-1:3000;
server openclaw-2:3000;
server openclaw-3:3000;
}
Health Checks
upstream openclaw_backend {
server openclaw-1:3000 max_fails=3 fail_timeout=30s;
server openclaw-2:3000 max_fails=3 fail_timeout=30s;
server openclaw-3:3000 max_fails=3 fail_timeout=30s;
}
Channel Instance Binding
For channels like Telegram and Discord that maintain WebSocket long connections, note that each channel's Bot connection can only be held by one instance. OpenClaw solves this through an instance lock mechanism:
{
cluster: {
enabled: true,
// Current instance ID, must be unique per instance
instanceId: "node-1",
// Channel assignment strategy
channelBinding: {
// Which instance is responsible for which channels
"node-1": ["telegram", "discord"],
"node-2": ["slack", "whatsapp"],
"node-3": ["webchat"]
}
}
}
If an instance goes down, its assigned channels will automatically fail over to other instances.
Zero-Downtime Rolling Updates
By combining the load balancer with health checks, you can achieve zero-downtime rolling updates:
#!/bin/bash
# rolling-update.sh
INSTANCES=("openclaw-1" "openclaw-2" "openclaw-3")
for instance in "${INSTANCES[@]}"; do
echo "Updating $instance ..."
# 1. Mark instance as maintenance mode (stop accepting new requests)
docker exec $instance openclaw maintenance on
# 2. Wait for current requests to complete
sleep 10
# 3. Pull new image and restart
docker compose pull $instance
docker compose up -d $instance
# 4. Wait for health check to pass
until curl -sf http://localhost:3000/api/v1/health; do
sleep 2
done
echo "$instance update complete"
done
Monitoring and Log Aggregation
In a multi-instance environment, centralized logging and monitoring are especially important.
Unified Log Format
{
logging: {
format: "json",
// Include instance ID in logs
includeInstanceId: true,
// Output to stdout for log collection
output: "stdout"
}
}
Prometheus Metrics
Each instance exposes a /api/v1/metrics endpoint that can be scraped by Prometheus:
# prometheus.yml
scrape_configs:
- job_name: "openclaw"
static_configs:
- targets:
- "openclaw-1:3000"
- "openclaw-2:3000"
- "openclaw-3:3000"
Key monitoring metrics include: requests per second, response latency percentiles, token consumption rate, active session count, and MCP tool invocation frequency.
Capacity Planning Recommendations
| Concurrent Users | Recommended Instances | Redis Configuration | CPU/Memory |
|---|---|---|---|
| < 50 | 1 (single instance) | Not needed | 1C/1G |
| 50-200 | 2-3 | Single-node Redis | 2C/2G per instance |
| 200-1000 | 3-5 | Redis Sentinel | 4C/4G per instance |
| > 1000 | 5+ | Redis Cluster | 8C/8G per instance |
It's worth noting that OpenClaw's primary bottleneck is usually not the Gateway itself, but rather the downstream AI model API's concurrency limits and response speeds.
Conclusion
OpenClaw's multi-instance deployment capability allows it to scale smoothly from a personal assistant to an enterprise-level service. By solving state consistency through a shared storage layer, distributing traffic through a load balancer, managing long connections through channel binding and instance locks, and adding rolling updates with centralized monitoring — this solution provides production-grade reliability and scalability for your AI Agent gateway.