Multimodal and Vision Model Configuration Guide for OpenClaw

Introduction

Multimodal capabilities allow AI to go beyond text -- it can now "see" and understand images, documents, and even video. In OpenClaw, you can send images through WhatsApp, Telegram, and other channels, and the AI assistant will automatically recognize and analyze the visual content. This tutorial provides a thorough walkthrough of configuring and using multimodal vision models.

Models with Vision Capabilities

The major vision models currently supported by OpenClaw:

Model	Provider	Image Understanding	PDF Understanding	Video Understanding	Cost
Claude Sonnet 4	Anthropic	★★★★★	★★★★★	✗	Med-High
Claude Haiku 3.5	Anthropic	★★★★☆	★★★★☆	✗	Low
GPT-4o	OpenAI	★★★★★	★★★★☆	✗	Medium
GPT-4o mini	OpenAI	★★★★☆	★★★☆☆	✗	Very Low
Gemini 2.5 Pro	Google	★★★★★	★★★★★	★★★★☆	Medium
Gemini 2.5 Flash	Google	★★★★☆	★★★★☆	★★★☆☆	Low
Llava 13B	Ollama (Local)	★★★☆☆	✗	✗	Free
Qwen 2.5 VL 72B	Ollama (Local)	★★★★☆	★★★☆☆	✗	Free

Step 1: Basic Configuration

1.1 Using a Cloud Vision Model

Using Claude Sonnet 4 as an example to configure vision capabilities:

{
  models: {
    vision: {
      provider: "anthropic",
      apiKey: "${ANTHROPIC_API_KEY}",
      defaultModel: "claude-sonnet-4",
      capabilities: {
        vision: true,          // Enable image understanding
        maxImageSize: 20,      // Maximum image size (MB)
        maxImagesPerMessage: 5, // Maximum images per message
      },
      parameters: {
        temperature: 0.3,       // Lower temperature recommended for vision tasks
        maxTokens: 4096,
      }
    }
  }
}

1.2 Using GPT-4o Vision

{
  models: {
    vision: {
      provider: "openai",
      apiKey: "${OPENAI_API_KEY}",
      defaultModel: "gpt-4o",
      capabilities: {
        vision: true,
        imageDetail: "high",     // high or low; high is more accurate but costlier
      }
    }
  }
}

1.3 Using Gemini Vision

{
  models: {
    vision: {
      provider: "google",
      apiKey: "${GOOGLE_AI_API_KEY}",
      defaultModel: "gemini-2.5-flash",
      capabilities: {
        vision: true,
        pdf: true,               // Gemini natively supports PDF
        video: true,             // Supports video understanding
      }
    }
  }
}

1.4 Using a Local Vision Model

# Download a vision-capable local model
ollama pull llava:13b
# Or the more powerful Qwen VL
ollama pull qwen2.5-vl:72b   # Requires significant VRAM

Configure OpenClaw:

{
  models: {
    "local-vision": {
      provider: "ollama",
      baseUrl: "http://localhost:11434",
      defaultModel: "llava:13b",
      capabilities: {
        vision: true,
      }
    }
  }
}

Step 2: Channel Integration Configuration

2.1 Telegram Image Handling

Telegram supports sending images directly. Configuration:

{
  channels: {
    telegram: {
      token: "${TELEGRAM_BOT_TOKEN}",
      model: "vision",             // Points to the vision model configured above
      mediaHandling: {
        images: true,              // Receive and process images
        documents: true,           // Receive documents (PDFs, etc.)
        maxFileSize: 20,           // Maximum file size (MB)
        compressImages: true,      // Compress large images to save tokens
        compressionQuality: 85,    // Compression quality (0-100)
      }
    }
  }
}

2.2 WhatsApp Image Handling

{
  channels: {
    whatsapp: {
      model: "vision",
      mediaHandling: {
        images: true,
        documents: true,
        stickers: false,            // Stickers usually don't need analysis
      }
    }
  }
}

2.3 Discord Image Handling

{
  channels: {
    discord: {
      token: "${DISCORD_BOT_TOKEN}",
      model: "vision",
      mediaHandling: {
        images: true,
        attachments: true,          // Process message attachments
      }
    }
  }
}

Step 3: Use Case Walkthrough

3.1 OCR Text Recognition

Send an image containing text in a chat, and the AI will automatically recognize the text content:

User: [sends a photo of a restaurant menu]
User: Please translate this menu

AI: I can see this is a Japanese restaurant menu. Here is the translation:
    - 刺身盛り合わせ → Assorted Sashimi Platter
    - 焼き鳥 → Grilled Chicken Skewers
    - 天ぷら盛り合わせ → Assorted Tempura
    ...

Optimized system prompt for OCR:

{
  models: {
    vision: {
      provider: "anthropic",
      defaultModel: "claude-sonnet-4",
      systemPrompt: "When the user sends an image, first carefully identify all text in the image, then process it according to the user's request. For blurry or uncertain text, mark it with [?].",
      capabilities: {
        vision: true,
      }
    }
  }
}

3.2 Chart and Data Analysis

Send charts or table screenshots, and the AI can interpret the data:

User: [sends a sales data line chart]
User: Analyze this trend

AI: Based on the chart:
    1. Overall trend: Sales steadily increased from Q1-Q3, with a dip in Q4
    2. Peak occurred in September at approximately $450K
    3. Year-over-year growth of about 23%
    4. The Q4 decline may be related to seasonal factors
    Recommend investigating the Q4 downturn and planning ahead for next year.

3.3 Document and Screenshot Understanding

Send UI screenshots, code screenshots, document photos, and more:

User: [sends an error screenshot]
User: How do I fix this error?

AI: The error message reads "TypeError: Cannot read property 'map' of undefined".
    This means you're trying to call .map() on an undefined value.
    Possible causes:
    1. The API response format differs from expectations
    2. Attempting to render before data has loaded
    Suggestion: Add a null check before calling .map():
    data?.items?.map(...) or (data.items || []).map(...)

3.4 Image Comparison

Send multiple images for comparative analysis:

User: [sends two interior design renderings]
User: Compare the pros and cons of these two options

AI: Option A (first image):
    - Pros: Spacious feel, great lighting, modern minimalist style
    - Cons: Limited storage space, cool color palette

    Option B (second image):
    - Pros: Ample storage, warm and cozy tones
    - Cons: Space feels compact, furniture is densely arranged

    Recommendation: Choose Option B for comfort, Option A for spaciousness.

Step 4: Image Processing Optimization

4.1 Controlling Resolution and Cost

High-resolution images consume more tokens (especially with OpenAI models). Proper configuration can save costs:

{
  models: {
    vision: {
      provider: "openai",
      defaultModel: "gpt-4o",
      capabilities: {
        vision: true,
        imageDetail: "low",       // low: fixed 85 tokens
                                   // high: calculated by resolution, can be thousands of tokens
        imageResize: {
          maxWidth: 1024,          // Automatically scale down large images
          maxHeight: 1024,
        }
      }
    }
  }
}

4.2 Token Consumption Reference

Setting	Image Size	Approx. Token Cost	Cost (GPT-4o)
low detail	Any	85 tokens	~$0.0002
high detail	512x512	~170 tokens	~$0.0004
high detail	1024x1024	~765 tokens	~$0.002
high detail	2048x2048	~1105 tokens	~$0.003

4.3 Image Caching

For scenarios involving frequent analysis of similar images, enabling caching can reduce redundant processing:

{
  models: {
    vision: {
      provider: "anthropic",
      defaultModel: "claude-sonnet-4",
      capabilities: {
        vision: true,
      },
      cache: {
        enabled: true,
        imageHashCache: true,     // Cache analysis results for identical images
        ttl: 3600,                // Cache validity period (seconds)
      }
    }
  }
}

Step 5: Vision Model Comparison

Real-World Test Results

Test Scenario	Claude Sonnet 4	GPT-4o	Gemini 2.5 Pro	Llava 13B
Chinese OCR	Excellent	Excellent	Good	Fair
English OCR	Excellent	Excellent	Excellent	Good
Handwriting recognition	Good	Good	Good	Poor
Chart interpretation	Excellent	Excellent	Excellent	Fair
UI screenshot analysis	Excellent	Excellent	Good	Fair
Photo understanding	Excellent	Excellent	Excellent	Good
Multi-image comparison	Excellent	Good	Excellent	Poor
PDF parsing	Excellent	Good	Excellent	Not supported

Recommendations

Best quality: Claude Sonnet 4 (strongest all-around performance)
Best value: Gemini 2.5 Flash (free quota + solid quality)
Completely free: Llava 13B locally deployed (limited quality but functional)
PDF processing: Gemini 2.5 Pro (best native PDF support)

Troubleshooting

Images Not Recognized

Confirm that vision: true is enabled in the configuration and check the logs:

openclaw logs | grep -i "vision\|image\|media"

Image Too Large, Causing Timeout

capabilities: {
  vision: true,
  imageResize: {
    maxWidth: 1024,
    maxHeight: 1024,
  }
}

Poor Results with Local Vision Models

Local vision models like Llava have limited performance in complex scenarios. For tasks requiring high accuracy, use cloud-based models instead.

No Response After Sending a Video

Currently only the Gemini series supports video understanding. Other models will ignore or reject videos. You can configure a helpful response message:

mediaHandling: {
  video: false,
  unsupportedMediaReply: "Sorry, I don't currently support video analysis. Please send a screenshot or image instead.",
}

Summary

Multimodal vision capabilities greatly expand the range of use cases for OpenClaw. From everyday OCR to professional chart analysis, vision models make chatbots significantly more practical. We recommend Claude Sonnet 4 or GPT-4o for the best visual understanding, Gemini 2.5 Flash to experience vision features within the free quota, or Llava/Qwen VL for free local deployment. Choose the approach that best matches your needs and budget.