Home Tutorials Categories Skills About
ZH EN JA KO
Configuration

OpenClaw Multimedia Message Processing Configuration

· 16 min read

Introduction

Modern instant messaging goes beyond text. Users may send images for AI to describe, voice messages for AI to transcribe, or documents for AI to analyze. As a multi-channel AI gateway, OpenClaw needs to handle multimedia messages from different platforms in a unified way, while each platform has different media capabilities. This article provides a detailed guide on OpenClaw's multimedia message processing mechanism and configuration methods.

Supported Media Types

OpenClaw currently supports the following four categories of multimedia content:

Media Type Supported Formats Processing Method
Images JPG, PNG, GIF, WebP Passed directly to vision-capable models
Audio MP3, OGG, WAV, M4A Transcribed via STT engine, then processed as text input
Documents PDF, TXT, DOCX, CSV Text content extracted and used as context input
Video MP4, WebM Key frames extracted and processed as an image sequence

Global Media Configuration

In openclaw.json, the global media processing configuration is under the advanced.media section:

{
  "advanced": {
    "media": {
      "enabled": true,
      "maxFileSize": "20mb",
      "tempDir": "~/.openclaw/temp/media",
      "cleanupInterval": "1h",
      "image": {
        "enabled": true,
        "maxResolution": "2048x2048",
        "autoResize": true,
        "compressionQuality": 85
      },
      "audio": {
        "enabled": true,
        "sttProvider": "whisper",
        "sttModel": "whisper-1",
        "maxDuration": 300
      },
      "document": {
        "enabled": true,
        "maxPages": 50,
        "ocrEnabled": false
      },
      "video": {
        "enabled": false,
        "maxDuration": 60,
        "keyframeInterval": 5
      }
    }
  }
}

Key Parameter Descriptions

  • maxFileSize: Maximum file size allowed for processing; files exceeding this limit will be ignored and a notification returned
  • autoResize: Automatically scales images when their resolution exceeds the model's limits
  • sttProvider: Speech-to-text service provider; supports whisper (OpenAI) and local (local model)
  • maxDuration: Maximum duration limit for audio and video (in seconds)

Image Message Processing Flow

When a user sends an image, OpenClaw processes it as follows:

  1. Receive: Download the image from the channel API to the temporary directory
  2. Pre-process: Check format and size; convert format and resize as needed
  3. Route: Check whether the current Agent's configured model supports vision input
  4. Send: Attach the image to the API request as Base64 or URL
  5. Clean up: Delete temporary files on the cleanupInterval schedule after processing

If the current model doesn't support vision capabilities, OpenClaw has two fallback strategies:

{
  "advanced": {
    "media": {
      "image": {
        "fallbackOnNoVision": "describe",
        "visionFallbackModel": "claude"
      }
    }
  }
}
  • "describe": Automatically switch to a vision-capable model to describe the image content, then pass the description text to the original model
  • "reject": Directly inform the user that the current model doesn't support image input

Audio Message Processing

Audio messages are first processed through speech-to-text (STT), then passed to the model as text messages:

User sends voice → Download audio file → STT transcription → Text as user message → Send to model

The STT result is included in the session record's metadata field for reference:

{
  "id": "msg_010",
  "role": "user",
  "content": "Book a meeting room for me at 3 PM tomorrow",
  "metadata": {
    "originalType": "audio",
    "sttConfidence": 0.95,
    "audioDuration": 4.2
  }
}

Document Message Processing

Documents uploaded by users have their text content extracted and injected into the conversation as context:

{
  "advanced": {
    "media": {
      "document": {
        "enabled": true,
        "maxPages": 50,
        "extractionMethod": "auto",
        "injectAs": "system"
      }
    }
  }
}

injectAs controls the role of the document content in the context:

  • "system": Injected as a system message; the model treats it as background knowledge
  • "user": Included as part of the user message, appended after the user's text

Media Support Differences Across Channels

Different instant messaging platforms have significantly different media message support. Here is how OpenClaw adapts:

Channel Image Receive Image Send Audio Receive Document Receive Notes
Telegram Full support Full support Full support Full support Most comprehensive media support
WhatsApp Full support Full support Full support Full support Requires Business API
Discord Full support Full support Limited support Full support Voice messages need extra config
Slack Full support Full support Not supported Full support Files via Files API
iMessage Full support Full support Full support Limited support Relies on BlueBubbles bridge
Signal Full support Full support Full support Limited support Via signal-cli
Web Dashboard Full support Full support Not supported Full support Upload via browser
Matrix Full support Full support Full support Full support Full Matrix media API

Channel-Level Media Overrides

If you want to disable specific media features on certain channels, you can override the global settings in the channel configuration:

{
  "channels": {
    "telegram": {
      "enabled": true,
      "token": "your-token",
      "mediaSupport": {
        "image": true,
        "audio": true,
        "document": true,
        "video": false
      }
    },
    "slack": {
      "enabled": true,
      "mediaSupport": {
        "image": true,
        "audio": false,
        "document": true,
        "video": false
      }
    }
  }
}

Model Media Capabilities

Not all models support multimedia input. OpenClaw maintains an internal model capability table:

Model Image Input Audio Input Document Input
Claude Sonnet/Opus Supported Not supported (needs STT) Supports PDF
GPT-4o Supported Supported (native) Not supported (needs extraction)
Gemini 2.0 Flash Supported Supported (native) Supported
Ollama (LLaVA) Supported Not supported Not supported

OpenClaw automatically determines whether pre-processing is needed based on model capabilities. For example, audio sent to GPT-4o can be passed directly, while audio sent to Claude needs STT transcription first.

Media File Cleanup

To prevent temporary files from growing indefinitely, OpenClaw performs periodic cleanup:

# Manually clean media cache
openclaw media cleanup

# View media cache usage
openclaw media stats

Summary

OpenClaw's multimedia processing system fully adapts to the differentiated capabilities of each channel while providing a unified abstraction. Through proper configuration, you can enable your AI assistant to seamlessly handle images, audio, and documents across different platforms while controlling resource consumption. The key is to configure appropriate media processing parameters and fallback strategies based on the channels and models you actually use.

OpenClaw is a free, open-source personal AI assistant that supports WhatsApp, Telegram, Discord, and many more platforms