Introduction
When you send a message via WhatsApp or Telegram and have to wait a long time for a reply, the experience suffers significantly. Response latency can originate from multiple stages: network transmission, model inference, context processing, and more. This article will help you systematically identify bottlenecks and optimize them.
1. Response Latency Analysis
A message goes through the following stages from send to reply:
User sends message
│
├── ① Channel reception delay (typically < 1s)
│
├── ② OpenClaw processing delay (typically < 0.5s)
│ ├── Context assembly
│ ├── Skill matching
│ └── Prompt construction
│
├── ③ Model API delay (typically 1-30s) ★ Main bottleneck
│ ├── Network transmission
│ ├── Queue waiting
│ └── Model inference
│
└── ④ Reply delivery delay (typically < 1s)
1.1 Measuring Latency at Each Stage
# Check latency information in OpenClaw logs
openclaw logs | grep -i "latency\|duration\|took\|elapsed"
# Directly test model API latency
time curl -s https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "content-type: application/json" \
-d '{
"model": "claude-sonnet-4-20250514",
"max_tokens": 100,
"messages": [{"role":"user","content":"hello"}]
}' -o /dev/null -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nFirst Byte: %{time_starttransfer}s\nTotal: %{time_total}s\n"
Example output:
DNS: 0.012s
Connect: 0.145s
TLS: 0.298s
First Byte: 2.156s
Total: 2.890s
2. Impact of Model Selection on Speed
Different models have vastly different response speeds:
2.1 Model Latency Comparison
| Provider | Model | Avg. Time to First Token | Generation Speed (tokens/s) | Use Case |
|---|---|---|---|---|
| Anthropic | Claude Opus | 2-5s | 30-50 | Complex tasks |
| Anthropic | Claude Sonnet | 1-3s | 50-80 | Everyday conversation |
| Anthropic | Claude Haiku | 0.5-1.5s | 80-120 | Quick replies |
| OpenAI | GPT-4o | 1-3s | 50-80 | Everyday conversation |
| OpenAI | GPT-4o-mini | 0.5-1.5s | 80-120 | Quick replies |
| Groq | Llama 3 70B | 0.2-0.5s | 200-300 | Ultra-fast replies |
| Deepseek | Deepseek V3 | 1-3s | 40-60 | Cost-effective |
| Ollama | Llama 3 8B | 0.1-1s | 20-100* | Local deployment |
*Local model speed depends on hardware configuration
2.2 Choosing a Model Based on Use Case
// ~/.config/openclaw/openclaw.json5
{
"models": {
// Default model - balance between speed and quality
"primary": {
"provider": "anthropic",
"model": "claude-sonnet-4-20250514"
},
// Fast response model - for simple conversations
"fast": {
"provider": "groq",
"model": "llama-3.1-70b-versatile"
}
}
}
3. Context Window Optimization
The longer the context, the longer the model takes to process it. Optimizing context size is a key lever for improving speed.
3.1 Limiting Context Length
{
"conversation": {
// Reduce the number of retained history messages
"maxMessages": 30,
// Limit maximum length per message
"maxMessageLength": 4000,
// Keep system prompt concise
"systemPromptMaxLength": 2000
}
}
3.2 Using a Summarization Strategy
Automatically generate summaries to replace history messages when conversations exceed a certain length:
{
"conversation": {
"contextStrategy": "summarize",
"summarizeThreshold": 20, // Trigger summarization after 20 messages
"summarizeKeepRecent": 5 // Keep the 5 most recent messages in original form
}
}
3.3 Controlling Skill Context
Each loaded skill increases the system prompt length:
# View loaded skills
openclaw skill list
# Disable unnecessary skills to reduce context
# Keep only frequently used skills in the ~/.openclaw/skills/ directory
4. Streaming Output Configuration
Enabling streaming output allows users to see the beginning of the reply sooner, dramatically improving perceived latency.
4.1 Enabling Streaming Output
// ~/.config/openclaw/openclaw.json5
{
"streaming": {
"enabled": true,
// Buffer size (character count); content is sent once it reaches this threshold
"bufferSize": 50,
// Maximum buffer wait time (milliseconds)
"flushInterval": 500
}
}
4.2 Streaming Support by Channel
| Channel | Streaming Support | Implementation |
|---|---|---|
| Telegram | Partial | Message editing simulation |
| Discord | Partial | Message editing simulation |
| Not supported | Waits for complete reply | |
| Slack | Supported | Message updates |
| API | Fully supported | SSE event stream |
For channels that do not support streaming, you can enable chunked delivery:
{
"streaming": {
"fallbackMode": "chunked",
// Maximum characters per chunk
"chunkSize": 500
}
}
5. Network Latency Optimization
5.1 Choosing a Closer API Endpoint
{
"models": {
"primary": {
"provider": "anthropic",
"model": "claude-sonnet-4-20250514",
// If you are in Asia, use the Asia-Pacific endpoint (if available)
"baseUrl": "https://api.anthropic.com"
}
}
}
5.2 DNS Optimization
# Use faster DNS servers
# Cloudflare DNS
echo "nameserver 1.1.1.1" | sudo tee /etc/resolv.conf
echo "nameserver 1.0.0.1" | sudo tee -a /etc/resolv.conf
# Or use Google DNS
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
5.3 Proxy Optimization
If you access the API through a proxy, ensure the proxy latency is as low as possible:
# Test proxy latency
time curl -x http://127.0.0.1:7890 -s https://api.anthropic.com/v1/messages -o /dev/null
# Compare with direct connection latency
time curl -s https://api.anthropic.com/v1/messages -o /dev/null
If the proxy adds significant latency, consider switching proxy nodes or using a faster proxy protocol.
5.4 HTTP Connection Reuse
{
"network": {
// Enable HTTP Keep-Alive
"keepAlive": true,
// Maximum concurrent connections
"maxSockets": 10,
// Idle connection timeout
"keepAliveTimeout": 60000
}
}
6. Local Model Optimization
If you use Ollama to run local models, hardware configuration directly determines inference speed.
6.1 GPU Acceleration
# Check if the GPU is recognized by Ollama
ollama ps
# NVIDIA GPUs require CUDA drivers
nvidia-smi
# Verify Ollama is using the GPU
# Check logs for "using GPU" or "CUDA"
journalctl -u ollama | grep -i "gpu\|cuda"
6.2 Model Quantization Options
Quantization can significantly reduce memory requirements and improve inference speed:
| Quantization Level | Model Size (7B) | Speed | Quality Loss |
|---|---|---|---|
| FP16 | ~14GB | Baseline | None |
| Q8_0 | ~7.5GB | 1.5x | Negligible |
| Q5_K_M | ~5GB | 2x | Minor |
| Q4_K_M | ~4GB | 2.5x | Moderate |
| Q3_K_M | ~3.5GB | 3x | Noticeable |
# Choose the appropriate quantized version
ollama pull llama3:8b-q4_K_M
# Specify it in OpenClaw
{
"models": {
"primary": {
"provider": "ollama",
"model": "llama3:8b-q4_K_M"
}
}
}
6.3 Ollama Performance Tuning
# Increase GPU layer count (more layers computed on GPU)
OLLAMA_NUM_GPU=999 ollama serve
# Set parallel processing count
OLLAMA_NUM_PARALLEL=2 ollama serve
# Keep the model loaded in memory (avoid cold start latency)
OLLAMA_KEEP_ALIVE=24h ollama serve
7. Concurrent Request Handling
When multiple users send messages simultaneously, serial processing causes excessive wait times for subsequent users.
7.1 Concurrency Configuration
{
"performance": {
// Maximum concurrent requests
"maxConcurrentRequests": 5,
// Request queue strategy
"queueStrategy": "fifo",
// Handling when queue is full
"queueOverflow": "reject_with_message",
"queueOverflowMessage": "There are many requests at the moment. Please try again shortly."
}
}
7.2 Multi-Model Load Balancing
{
"models": {
"loadBalancing": {
"enabled": true,
"strategy": "round-robin",
"providers": [
{
"provider": "anthropic",
"model": "claude-sonnet-4-20250514",
"weight": 2
},
{
"provider": "openai",
"model": "gpt-4o",
"weight": 1
}
]
}
}
}
8. Performance Monitoring and Continuous Optimization
8.1 Response Latency Tracking
# Monitor average response time
openclaw logs | grep "response_time" | awk '{sum+=$NF; count++} END {print "Average latency:", sum/count, "ms"}'
# Find the slowest requests
openclaw logs | grep "response_time" | sort -t: -k2 -n -r | head -10
8.2 Optimization Checklist
□ Choose the right model for the task (use fast models for lightweight tasks)
□ Limit context length (maxMessages ≤ 30)
□ Enable streaming output
□ Optimize networking (DNS, proxy, connection reuse)
□ Enable GPU acceleration for local models
□ Configure reasonable concurrency limits
□ Minimize the number of loaded skills
□ Use summarization strategy to compress conversation history
By working through these items systematically, your OpenClaw response speed will improve significantly. In most cases, choosing the right model and optimizing context length alone will resolve 80% of latency issues.