OpenClaw 리소스 모니터링 및 알림 설정

소개

OpenClaw의 리소스 소비를 이해하는 것은 서비스 안정성을 보장하는 핵심입니다. 이 문서에서는 메모리, CPU, 네트워크 연결, 메시지 처리량을 포함한 OpenClaw의 리소스 사용량을 종합적으로 모니터링하고, 메트릭에 이상이 발생할 때 자동화된 알림 알림을 설정하는 방법을 다룹니다.

1. 내장 리소스 모니터링

1.1 openclaw stats 명령

OpenClaw은 리소스 사용량을 빠르게 파악할 수 있는 내장 통계 명령을 제공합니다:

# 현재 리소스 사용량 보기
openclaw stats

# 출력 예시
# ┌─────────────────────────────────────────┐
# │         OpenClaw Runtime Statistics     │
# ├─────────────────────────────────────────┤
# │ Uptime:       3d 12h 45m               │
# │ Process PID:  12345                     │
# │ Memory (Heap): 168MB / 512MB (32%)     │
# │ Memory (RSS):  245MB                    │
# │ CPU (1m avg):  2.3%                     │
# │ Active Channels: 3 / 3                  │
# │ Today's Messages: 342                   │
# │ Today's Tokens:   125,800               │
# │ Today's Cost:     $1.85                 │
# │ Avg Response:     1.8s                  │
# │ Error Rate:       0.3%                  │
# └─────────────────────────────────────────┘

1.2 실시간 모니터링 대시보드

# 실시간 모니터링 시작 (top 명령과 유사)
openclaw stats --live

# 사용자 정의 갱신 간격 (초)
openclaw stats --live --interval 5

실시간 대시보드는 다음 메트릭을 지속적으로 업데이트합니다:

메모리 사용량 추세 그래프 (ASCII 차트)
분당 처리된 메시지 수
현재 활성 연결 수
API 호출 지연
오류 수

1.3 이력 통계 조회

# 지난 24시간 메시지 통계 보기
openclaw stats --period 24h

# 지난 7일 리소스 추세 보기
openclaw stats --period 7d --metric memory

# 통계를 CSV로 내보내기
openclaw stats --period 30d --format csv > openclaw-stats.csv

2. HTTP API 모니터링 엔드포인트

2.1 런타임 메트릭 조회

# 기본 런타임 메트릭
curl -s http://localhost:18789/health/stats | jq .

응답 데이터:

{
  "uptime": 302400,
  "memory": {
    "heapUsed": 168000000,
    "heapTotal": 536870912,
    "rss": 257000000,
    "external": 15000000
  },
  "cpu": {
    "user": 125000,
    "system": 45000,
    "percent": 2.3
  },
  "messages": {
    "today": 342,
    "thisHour": 28,
    "total": 15680
  },
  "tokens": {
    "today": {
      "input": 89500,
      "output": 36300
    }
  },
  "responseTime": {
    "avg": 1800,
    "p50": 1500,
    "p95": 3200,
    "p99": 5100
  },
  "errors": {
    "today": 3,
    "rate": 0.003
  }
}

2.2 채널별 통계

# 채널별 메시지 통계 가져오기
curl -s http://localhost:18789/health/channels | jq .

{
  "channels": [
    {
      "name": "whatsapp",
      "status": "connected",
      "uptime": 302400,
      "messagesReceived": 180,
      "messagesSent": 175,
      "avgResponseTime": 1650,
      "errors": 1
    },
    {
      "name": "telegram",
      "status": "connected",
      "uptime": 302400,
      "messagesReceived": 120,
      "messagesSent": 118,
      "avgResponseTime": 1950,
      "errors": 2
    }
  ]
}

3. Prometheus 메트릭 수집

3.1 Prometheus 엔드포인트 활성화

// ~/.config/openclaw/openclaw.json5
{
  "monitoring": {
    "prometheus": {
      "enabled": true,
      "port": 9191,
      "path": "/metrics"
    }
  }
}

3.2 주요 Prometheus 메트릭

OpenClaw이 내보내는 전체 메트릭 목록:

메시지 처리 메트릭:

메트릭 이름	유형	설명
`openclaw_messages_received_total`	Counter	총 수신 메시지 (채널별 레이블)
`openclaw_messages_sent_total`	Counter	총 발신 메시지
`openclaw_messages_failed_total`	Counter	실패한 메시지 수

모델 호출 메트릭:

메트릭 이름	유형	설명
`openclaw_model_requests_total`	Counter	총 모델 API 호출 수
`openclaw_model_errors_total`	Counter	모델 API 오류 수
`openclaw_model_duration_seconds`	Histogram	모델 응답 시간 분포
`openclaw_model_tokens_total`	Counter	총 토큰 사용량 (입력/출력 레이블)

리소스 메트릭:

메트릭 이름	유형	설명
`openclaw_memory_heap_bytes`	Gauge	힙 메모리 사용량
`openclaw_memory_rss_bytes`	Gauge	상주 메모리 크기
`openclaw_active_connections`	Gauge	활성 연결 수
`openclaw_queue_length`	Gauge	요청 큐 길이

3.3 유용한 PromQL 쿼리

# 분당 처리 메시지 수
rate(openclaw_messages_received_total[5m]) * 60

# 채널별 메시지량
sum by (channel) (increase(openclaw_messages_received_total[24h]))

# 모델 호출 P95 지연
histogram_quantile(0.95, rate(openclaw_model_duration_seconds_bucket[5m]))

# 오류율
rate(openclaw_model_errors_total[5m]) / rate(openclaw_model_requests_total[5m])

# 메모리 사용률
openclaw_memory_heap_bytes / openclaw_memory_heap_max_bytes * 100

# 토큰 소비율 (시간당)
rate(openclaw_model_tokens_total[1h]) * 3600

4. 알림 규칙 설정

4.1 임계값 기반 알림

OpenClaw 설정에서 알림 규칙을 설정합니다:

{
  "alerts": {
    "enabled": true,
    "rules": [
      {
        "name": "High Memory Usage",
        "condition": "memory.heapPercent > 85",
        "duration": "10m",
        "severity": "warning",
        "message": "메모리 사용량이 85%를 초과합니다. 현재: {value}%"
      },
      {
        "name": "Slow Response",
        "condition": "responseTime.p95 > 5000",
        "duration": "5m",
        "severity": "warning",
        "message": "P95 응답 시간이 5초를 초과합니다. 현재: {value}ms"
      },
      {
        "name": "High Error Rate",
        "condition": "errors.rate > 0.05",
        "duration": "5m",
        "severity": "critical",
        "message": "오류율이 5%를 초과합니다. 현재: {value}"
      },
      {
        "name": "Channel Disconnected",
        "condition": "channels.disconnected > 0",
        "duration": "3m",
        "severity": "critical",
        "message": "{value}개 채널이 연결 해제됨"
      },
      {
        "name": "Queue Backlog",
        "condition": "queue.length > 30",
        "duration": "2m",
        "severity": "warning",
        "message": "요청 큐 적체: {value}개 항목"
      }
    ]
  }
}

4.2 알림 알림 채널

{
  "alerts": {
    "notifications": [
      {
        "type": "telegram",
        "botToken": "YOUR_BOT_TOKEN",
        "chatId": "YOUR_CHAT_ID",
        // critical 수준 알림만 수신
        "minSeverity": "critical"
      },
      {
        "type": "webhook",
        "url": "https://hooks.slack.com/services/xxx",
        "minSeverity": "warning"
      },
      {
        "type": "email",
        "to": "[email protected]",
        "minSeverity": "critical"
      }
    ],
    // 알림 쓰로틀링: 동일한 알림 간 최소 간격
    "throttle": "15m",
    // 해결 시 알림 전송
    "notifyOnResolve": true
  }
}

4.3 Grafana 알림 규칙

Grafana를 사용하는 경우 더 유연한 알림을 설정할 수 있습니다:

# Grafana 알림 규칙
groups:
  - name: openclaw
    rules:
      - alert: OpenClawHighMemory
        expr: openclaw_memory_heap_bytes > 400 * 1024 * 1024
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "OpenClaw 메모리 사용량이 너무 높음"
          description: "힙 메모리: {{ $value | humanizeBytes }}"

      - alert: OpenClawMessageBacklog
        expr: openclaw_queue_length > 20
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "OpenClaw 메시지 큐 적체"

      - alert: OpenClawDown
        expr: up{job="openclaw"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "OpenClaw 서비스 사용 불가"

5. 메시지량 및 비용 통계

5.1 일일 보고서

# 오늘의 요약 통계 보기
openclaw stats --period today --summary

# 출력
# Today's Statistics (2026-03-14)
# ──────────────────────
# Total Messages:    342
# Token Usage:       125,800 (input: 89,500 / output: 36,300)
# Estimated Cost:    $1.85
# Avg Response:      1.8s
# Slowest Response:  6.2s
# Errors:            3 (0.9%)
# Active Users:      28

5.2 비용 예측

# 이번 달 비용 추세 및 예측 보기
openclaw stats --cost --period month

# 출력
# Monthly Cost Statistics
# ──────────────────────
# Spent:            $42.50
# Daily Average:    $3.04
# Month-End Forecast: $94.20
# Highest Cost Channel: WhatsApp ($22.30)
# Highest Cost User:    user_abc ($8.50)

6. 모니터링 아키텍처 권장 사항

배포 규모에 맞는 모니터링 방식을 선택하세요:

개인 / 소규모 팀 (1~5명):

openclaw stats로 수동 확인
cron + 워치독 스크립트로 기본 상태 점검
Telegram 알림 알림

중규모 (5~50명):

Prometheus 메트릭 수집 활성화
Grafana 대시보드 배포
다단계 알림 규칙 설정
정기적으로 비용 보고서 검토

대규모 / 엔터프라이즈:

Prometheus + Grafana + Alertmanager 전체 스택
엔터프라이즈 모니터링 플랫폼 통합 (Datadog/New Relic)
중앙화된 로그 수집 (ELK/Loki)
SLA 모니터링 및 자동화 운영

규모에 맞는 모니터링 방식을 선택하세요 — 과도한 엔지니어링은 피하되, 핵심 메트릭에 사각지대가 없도록 하세요. 지속적인 모니터링은 OpenClaw을 안정적으로 운영하기 위한 기반입니다.