OpenClaw Multimedia Message Processing Configuration

Introduction

Modern instant messaging goes beyond text. Users may send images for AI to describe, voice messages for AI to transcribe, or documents for AI to analyze. As a multi-channel AI gateway, OpenClaw needs to handle multimedia messages from different platforms in a unified way, while each platform has different media capabilities. This article provides a detailed guide on OpenClaw's multimedia message processing mechanism and configuration methods.

Supported Media Types

OpenClaw currently supports the following four categories of multimedia content:

Media Type	Supported Formats	Processing Method
Images	JPG, PNG, GIF, WebP	Passed directly to vision-capable models
Audio	MP3, OGG, WAV, M4A	Transcribed via STT engine, then processed as text input
Documents	PDF, TXT, DOCX, CSV	Text content extracted and used as context input
Video	MP4, WebM	Key frames extracted and processed as an image sequence

Global Media Configuration

In openclaw.json, the global media processing configuration is under the advanced.media section:

{
  "advanced": {
    "media": {
      "enabled": true,
      "maxFileSize": "20mb",
      "tempDir": "~/.openclaw/temp/media",
      "cleanupInterval": "1h",
      "image": {
        "enabled": true,
        "maxResolution": "2048x2048",
        "autoResize": true,
        "compressionQuality": 85
      },
      "audio": {
        "enabled": true,
        "sttProvider": "whisper",
        "sttModel": "whisper-1",
        "maxDuration": 300
      },
      "document": {
        "enabled": true,
        "maxPages": 50,
        "ocrEnabled": false
      },
      "video": {
        "enabled": false,
        "maxDuration": 60,
        "keyframeInterval": 5
      }
    }
  }
}

Key Parameter Descriptions

maxFileSize: Maximum file size allowed for processing; files exceeding this limit will be ignored and a notification returned
autoResize: Automatically scales images when their resolution exceeds the model's limits
sttProvider: Speech-to-text service provider; supports whisper (OpenAI) and local (local model)
maxDuration: Maximum duration limit for audio and video (in seconds)

Image Message Processing Flow

When a user sends an image, OpenClaw processes it as follows:

Receive: Download the image from the channel API to the temporary directory
Pre-process: Check format and size; convert format and resize as needed
Route: Check whether the current Agent's configured model supports vision input
Send: Attach the image to the API request as Base64 or URL
Clean up: Delete temporary files on the cleanupInterval schedule after processing

If the current model doesn't support vision capabilities, OpenClaw has two fallback strategies:

{
  "advanced": {
    "media": {
      "image": {
        "fallbackOnNoVision": "describe",
        "visionFallbackModel": "claude"
      }
    }
  }
}

"describe": Automatically switch to a vision-capable model to describe the image content, then pass the description text to the original model
"reject": Directly inform the user that the current model doesn't support image input

Audio Message Processing

Audio messages are first processed through speech-to-text (STT), then passed to the model as text messages:

User sends voice → Download audio file → STT transcription → Text as user message → Send to model

The STT result is included in the session record's metadata field for reference:

{
  "id": "msg_010",
  "role": "user",
  "content": "Book a meeting room for me at 3 PM tomorrow",
  "metadata": {
    "originalType": "audio",
    "sttConfidence": 0.95,
    "audioDuration": 4.2
  }
}

Document Message Processing

Documents uploaded by users have their text content extracted and injected into the conversation as context:

{
  "advanced": {
    "media": {
      "document": {
        "enabled": true,
        "maxPages": 50,
        "extractionMethod": "auto",
        "injectAs": "system"
      }
    }
  }
}

injectAs controls the role of the document content in the context:

"system": Injected as a system message; the model treats it as background knowledge
"user": Included as part of the user message, appended after the user's text

Media Support Differences Across Channels

Different instant messaging platforms have significantly different media message support. Here is how OpenClaw adapts:

Channel	Image Receive	Image Send	Audio Receive	Document Receive	Notes
Telegram	Full support	Full support	Full support	Full support	Most comprehensive media support
WhatsApp	Full support	Full support	Full support	Full support	Requires Business API
Discord	Full support	Full support	Limited support	Full support	Voice messages need extra config
Slack	Full support	Full support	Not supported	Full support	Files via Files API
iMessage	Full support	Full support	Full support	Limited support	Relies on BlueBubbles bridge
Signal	Full support	Full support	Full support	Limited support	Via signal-cli
Web Dashboard	Full support	Full support	Not supported	Full support	Upload via browser
Matrix	Full support	Full support	Full support	Full support	Full Matrix media API

Channel-Level Media Overrides

If you want to disable specific media features on certain channels, you can override the global settings in the channel configuration:

{
  "channels": {
    "telegram": {
      "enabled": true,
      "token": "your-token",
      "mediaSupport": {
        "image": true,
        "audio": true,
        "document": true,
        "video": false
      }
    },
    "slack": {
      "enabled": true,
      "mediaSupport": {
        "image": true,
        "audio": false,
        "document": true,
        "video": false
      }
    }
  }
}

Model Media Capabilities

Not all models support multimedia input. OpenClaw maintains an internal model capability table:

Model	Image Input	Audio Input	Document Input
Claude Sonnet/Opus	Supported	Not supported (needs STT)	Supports PDF
GPT-4o	Supported	Supported (native)	Not supported (needs extraction)
Gemini 2.0 Flash	Supported	Supported (native)	Supported
Ollama (LLaVA)	Supported	Not supported	Not supported

OpenClaw automatically determines whether pre-processing is needed based on model capabilities. For example, audio sent to GPT-4o can be passed directly, while audio sent to Claude needs STT transcription first.

Media File Cleanup

To prevent temporary files from growing indefinitely, OpenClaw performs periodic cleanup:

# Manually clean media cache
openclaw media cleanup

# View media cache usage
openclaw media stats

Summary

OpenClaw's multimedia processing system fully adapts to the differentiated capabilities of each channel while providing a unified abstraction. Through proper configuration, you can enable your AI assistant to seamlessly handle images, audio, and documents across different platforms while controlling resource consumption. The key is to configure appropriate media processing parameters and fallback strategies based on the channels and models you actually use.