Introduction
Multimodal capabilities allow AI to go beyond text -- it can now "see" and understand images, documents, and even video. In OpenClaw, you can send images through WhatsApp, Telegram, and other channels, and the AI assistant will automatically recognize and analyze the visual content. This tutorial provides a thorough walkthrough of configuring and using multimodal vision models.
Models with Vision Capabilities
The major vision models currently supported by OpenClaw:
| Model | Provider | Image Understanding | PDF Understanding | Video Understanding | Cost |
|---|---|---|---|---|---|
| Claude Sonnet 4 | Anthropic | ★★★★★ | ★★★★★ | ✗ | Med-High |
| Claude Haiku 3.5 | Anthropic | ★★★★☆ | ★★★★☆ | ✗ | Low |
| GPT-4o | OpenAI | ★★★★★ | ★★★★☆ | ✗ | Medium |
| GPT-4o mini | OpenAI | ★★★★☆ | ★★★☆☆ | ✗ | Very Low |
| Gemini 2.5 Pro | ★★★★★ | ★★★★★ | ★★★★☆ | Medium | |
| Gemini 2.5 Flash | ★★★★☆ | ★★★★☆ | ★★★☆☆ | Low | |
| Llava 13B | Ollama (Local) | ★★★☆☆ | ✗ | ✗ | Free |
| Qwen 2.5 VL 72B | Ollama (Local) | ★★★★☆ | ★★★☆☆ | ✗ | Free |
Step 1: Basic Configuration
1.1 Using a Cloud Vision Model
Using Claude Sonnet 4 as an example to configure vision capabilities:
{
models: {
vision: {
provider: "anthropic",
apiKey: "${ANTHROPIC_API_KEY}",
defaultModel: "claude-sonnet-4",
capabilities: {
vision: true, // Enable image understanding
maxImageSize: 20, // Maximum image size (MB)
maxImagesPerMessage: 5, // Maximum images per message
},
parameters: {
temperature: 0.3, // Lower temperature recommended for vision tasks
maxTokens: 4096,
}
}
}
}
1.2 Using GPT-4o Vision
{
models: {
vision: {
provider: "openai",
apiKey: "${OPENAI_API_KEY}",
defaultModel: "gpt-4o",
capabilities: {
vision: true,
imageDetail: "high", // high or low; high is more accurate but costlier
}
}
}
}
1.3 Using Gemini Vision
{
models: {
vision: {
provider: "google",
apiKey: "${GOOGLE_AI_API_KEY}",
defaultModel: "gemini-2.5-flash",
capabilities: {
vision: true,
pdf: true, // Gemini natively supports PDF
video: true, // Supports video understanding
}
}
}
}
1.4 Using a Local Vision Model
# Download a vision-capable local model
ollama pull llava:13b
# Or the more powerful Qwen VL
ollama pull qwen2.5-vl:72b # Requires significant VRAM
Configure OpenClaw:
{
models: {
"local-vision": {
provider: "ollama",
baseUrl: "http://localhost:11434",
defaultModel: "llava:13b",
capabilities: {
vision: true,
}
}
}
}
Step 2: Channel Integration Configuration
2.1 Telegram Image Handling
Telegram supports sending images directly. Configuration:
{
channels: {
telegram: {
token: "${TELEGRAM_BOT_TOKEN}",
model: "vision", // Points to the vision model configured above
mediaHandling: {
images: true, // Receive and process images
documents: true, // Receive documents (PDFs, etc.)
maxFileSize: 20, // Maximum file size (MB)
compressImages: true, // Compress large images to save tokens
compressionQuality: 85, // Compression quality (0-100)
}
}
}
}
2.2 WhatsApp Image Handling
{
channels: {
whatsapp: {
model: "vision",
mediaHandling: {
images: true,
documents: true,
stickers: false, // Stickers usually don't need analysis
}
}
}
}
2.3 Discord Image Handling
{
channels: {
discord: {
token: "${DISCORD_BOT_TOKEN}",
model: "vision",
mediaHandling: {
images: true,
attachments: true, // Process message attachments
}
}
}
}
Step 3: Use Case Walkthrough
3.1 OCR Text Recognition
Send an image containing text in a chat, and the AI will automatically recognize the text content:
User: [sends a photo of a restaurant menu]
User: Please translate this menu
AI: I can see this is a Japanese restaurant menu. Here is the translation:
- 刺身盛り合わせ → Assorted Sashimi Platter
- 焼き鳥 → Grilled Chicken Skewers
- 天ぷら盛り合わせ → Assorted Tempura
...
Optimized system prompt for OCR:
{
models: {
vision: {
provider: "anthropic",
defaultModel: "claude-sonnet-4",
systemPrompt: "When the user sends an image, first carefully identify all text in the image, then process it according to the user's request. For blurry or uncertain text, mark it with [?].",
capabilities: {
vision: true,
}
}
}
}
3.2 Chart and Data Analysis
Send charts or table screenshots, and the AI can interpret the data:
User: [sends a sales data line chart]
User: Analyze this trend
AI: Based on the chart:
1. Overall trend: Sales steadily increased from Q1-Q3, with a dip in Q4
2. Peak occurred in September at approximately $450K
3. Year-over-year growth of about 23%
4. The Q4 decline may be related to seasonal factors
Recommend investigating the Q4 downturn and planning ahead for next year.
3.3 Document and Screenshot Understanding
Send UI screenshots, code screenshots, document photos, and more:
User: [sends an error screenshot]
User: How do I fix this error?
AI: The error message reads "TypeError: Cannot read property 'map' of undefined".
This means you're trying to call .map() on an undefined value.
Possible causes:
1. The API response format differs from expectations
2. Attempting to render before data has loaded
Suggestion: Add a null check before calling .map():
data?.items?.map(...) or (data.items || []).map(...)
3.4 Image Comparison
Send multiple images for comparative analysis:
User: [sends two interior design renderings]
User: Compare the pros and cons of these two options
AI: Option A (first image):
- Pros: Spacious feel, great lighting, modern minimalist style
- Cons: Limited storage space, cool color palette
Option B (second image):
- Pros: Ample storage, warm and cozy tones
- Cons: Space feels compact, furniture is densely arranged
Recommendation: Choose Option B for comfort, Option A for spaciousness.
Step 4: Image Processing Optimization
4.1 Controlling Resolution and Cost
High-resolution images consume more tokens (especially with OpenAI models). Proper configuration can save costs:
{
models: {
vision: {
provider: "openai",
defaultModel: "gpt-4o",
capabilities: {
vision: true,
imageDetail: "low", // low: fixed 85 tokens
// high: calculated by resolution, can be thousands of tokens
imageResize: {
maxWidth: 1024, // Automatically scale down large images
maxHeight: 1024,
}
}
}
}
}
4.2 Token Consumption Reference
| Setting | Image Size | Approx. Token Cost | Cost (GPT-4o) |
|---|---|---|---|
| low detail | Any | 85 tokens | ~$0.0002 |
| high detail | 512x512 | ~170 tokens | ~$0.0004 |
| high detail | 1024x1024 | ~765 tokens | ~$0.002 |
| high detail | 2048x2048 | ~1105 tokens | ~$0.003 |
4.3 Image Caching
For scenarios involving frequent analysis of similar images, enabling caching can reduce redundant processing:
{
models: {
vision: {
provider: "anthropic",
defaultModel: "claude-sonnet-4",
capabilities: {
vision: true,
},
cache: {
enabled: true,
imageHashCache: true, // Cache analysis results for identical images
ttl: 3600, // Cache validity period (seconds)
}
}
}
}
Step 5: Vision Model Comparison
Real-World Test Results
| Test Scenario | Claude Sonnet 4 | GPT-4o | Gemini 2.5 Pro | Llava 13B |
|---|---|---|---|---|
| Chinese OCR | Excellent | Excellent | Good | Fair |
| English OCR | Excellent | Excellent | Excellent | Good |
| Handwriting recognition | Good | Good | Good | Poor |
| Chart interpretation | Excellent | Excellent | Excellent | Fair |
| UI screenshot analysis | Excellent | Excellent | Good | Fair |
| Photo understanding | Excellent | Excellent | Excellent | Good |
| Multi-image comparison | Excellent | Good | Excellent | Poor |
| PDF parsing | Excellent | Good | Excellent | Not supported |
Recommendations
- Best quality: Claude Sonnet 4 (strongest all-around performance)
- Best value: Gemini 2.5 Flash (free quota + solid quality)
- Completely free: Llava 13B locally deployed (limited quality but functional)
- PDF processing: Gemini 2.5 Pro (best native PDF support)
Troubleshooting
Images Not Recognized
Confirm that vision: true is enabled in the configuration and check the logs:
openclaw logs | grep -i "vision\|image\|media"
Image Too Large, Causing Timeout
capabilities: {
vision: true,
imageResize: {
maxWidth: 1024,
maxHeight: 1024,
}
}
Poor Results with Local Vision Models
Local vision models like Llava have limited performance in complex scenarios. For tasks requiring high accuracy, use cloud-based models instead.
No Response After Sending a Video
Currently only the Gemini series supports video understanding. Other models will ignore or reject videos. You can configure a helpful response message:
mediaHandling: {
video: false,
unsupportedMediaReply: "Sorry, I don't currently support video analysis. Please send a screenshot or image instead.",
}
Summary
Multimodal vision capabilities greatly expand the range of use cases for OpenClaw. From everyday OCR to professional chart analysis, vision models make chatbots significantly more practical. We recommend Claude Sonnet 4 or GPT-4o for the best visual understanding, Gemini 2.5 Flash to experience vision features within the free quota, or Llava/Qwen VL for free local deployment. Choose the approach that best matches your needs and budget.