Introduction
Modern instant messaging goes beyond text. Users may send images for AI to describe, voice messages for AI to transcribe, or documents for AI to analyze. As a multi-channel AI gateway, OpenClaw needs to handle multimedia messages from different platforms in a unified way, while each platform has different media capabilities. This article provides a detailed guide on OpenClaw's multimedia message processing mechanism and configuration methods.
Supported Media Types
OpenClaw currently supports the following four categories of multimedia content:
| Media Type | Supported Formats | Processing Method |
|---|---|---|
| Images | JPG, PNG, GIF, WebP | Passed directly to vision-capable models |
| Audio | MP3, OGG, WAV, M4A | Transcribed via STT engine, then processed as text input |
| Documents | PDF, TXT, DOCX, CSV | Text content extracted and used as context input |
| Video | MP4, WebM | Key frames extracted and processed as an image sequence |
Global Media Configuration
In openclaw.json, the global media processing configuration is under the advanced.media section:
{
"advanced": {
"media": {
"enabled": true,
"maxFileSize": "20mb",
"tempDir": "~/.openclaw/temp/media",
"cleanupInterval": "1h",
"image": {
"enabled": true,
"maxResolution": "2048x2048",
"autoResize": true,
"compressionQuality": 85
},
"audio": {
"enabled": true,
"sttProvider": "whisper",
"sttModel": "whisper-1",
"maxDuration": 300
},
"document": {
"enabled": true,
"maxPages": 50,
"ocrEnabled": false
},
"video": {
"enabled": false,
"maxDuration": 60,
"keyframeInterval": 5
}
}
}
}
Key Parameter Descriptions
maxFileSize: Maximum file size allowed for processing; files exceeding this limit will be ignored and a notification returnedautoResize: Automatically scales images when their resolution exceeds the model's limitssttProvider: Speech-to-text service provider; supportswhisper(OpenAI) andlocal(local model)maxDuration: Maximum duration limit for audio and video (in seconds)
Image Message Processing Flow
When a user sends an image, OpenClaw processes it as follows:
- Receive: Download the image from the channel API to the temporary directory
- Pre-process: Check format and size; convert format and resize as needed
- Route: Check whether the current Agent's configured model supports vision input
- Send: Attach the image to the API request as Base64 or URL
- Clean up: Delete temporary files on the
cleanupIntervalschedule after processing
If the current model doesn't support vision capabilities, OpenClaw has two fallback strategies:
{
"advanced": {
"media": {
"image": {
"fallbackOnNoVision": "describe",
"visionFallbackModel": "claude"
}
}
}
}
"describe": Automatically switch to a vision-capable model to describe the image content, then pass the description text to the original model"reject": Directly inform the user that the current model doesn't support image input
Audio Message Processing
Audio messages are first processed through speech-to-text (STT), then passed to the model as text messages:
User sends voice → Download audio file → STT transcription → Text as user message → Send to model
The STT result is included in the session record's metadata field for reference:
{
"id": "msg_010",
"role": "user",
"content": "Book a meeting room for me at 3 PM tomorrow",
"metadata": {
"originalType": "audio",
"sttConfidence": 0.95,
"audioDuration": 4.2
}
}
Document Message Processing
Documents uploaded by users have their text content extracted and injected into the conversation as context:
{
"advanced": {
"media": {
"document": {
"enabled": true,
"maxPages": 50,
"extractionMethod": "auto",
"injectAs": "system"
}
}
}
}
injectAs controls the role of the document content in the context:
"system": Injected as a system message; the model treats it as background knowledge"user": Included as part of the user message, appended after the user's text
Media Support Differences Across Channels
Different instant messaging platforms have significantly different media message support. Here is how OpenClaw adapts:
| Channel | Image Receive | Image Send | Audio Receive | Document Receive | Notes |
|---|---|---|---|---|---|
| Telegram | Full support | Full support | Full support | Full support | Most comprehensive media support |
| Full support | Full support | Full support | Full support | Requires Business API | |
| Discord | Full support | Full support | Limited support | Full support | Voice messages need extra config |
| Slack | Full support | Full support | Not supported | Full support | Files via Files API |
| iMessage | Full support | Full support | Full support | Limited support | Relies on BlueBubbles bridge |
| Signal | Full support | Full support | Full support | Limited support | Via signal-cli |
| Web Dashboard | Full support | Full support | Not supported | Full support | Upload via browser |
| Matrix | Full support | Full support | Full support | Full support | Full Matrix media API |
Channel-Level Media Overrides
If you want to disable specific media features on certain channels, you can override the global settings in the channel configuration:
{
"channels": {
"telegram": {
"enabled": true,
"token": "your-token",
"mediaSupport": {
"image": true,
"audio": true,
"document": true,
"video": false
}
},
"slack": {
"enabled": true,
"mediaSupport": {
"image": true,
"audio": false,
"document": true,
"video": false
}
}
}
}
Model Media Capabilities
Not all models support multimedia input. OpenClaw maintains an internal model capability table:
| Model | Image Input | Audio Input | Document Input |
|---|---|---|---|
| Claude Sonnet/Opus | Supported | Not supported (needs STT) | Supports PDF |
| GPT-4o | Supported | Supported (native) | Not supported (needs extraction) |
| Gemini 2.0 Flash | Supported | Supported (native) | Supported |
| Ollama (LLaVA) | Supported | Not supported | Not supported |
OpenClaw automatically determines whether pre-processing is needed based on model capabilities. For example, audio sent to GPT-4o can be passed directly, while audio sent to Claude needs STT transcription first.
Media File Cleanup
To prevent temporary files from growing indefinitely, OpenClaw performs periodic cleanup:
# Manually clean media cache
openclaw media cleanup
# View media cache usage
openclaw media stats
Summary
OpenClaw's multimedia processing system fully adapts to the differentiated capabilities of each channel while providing a unified abstraction. Through proper configuration, you can enable your AI assistant to seamlessly handle images, audio, and documents across different platforms while controlling resource consumption. The key is to configure appropriate media processing parameters and fallback strategies based on the channels and models you actually use.