OpenClaw Voice Synthesis and TTS Configuration

Introduction

Text is not the only form of AI interaction. OpenClaw has built-in Text-to-Speech (TTS) support, allowing your AI Agent to reply to users with voice messages. Whether it is sending voice notes on Telegram or reading replies aloud in a Discord voice channel, OpenClaw can achieve this with simple configuration.

This article covers TTS configuration methods, supported engines, and advanced usage in detail.

Enabling TTS

Basic Configuration

Enable TTS in openclaw.json5:

{
  agents: {
    "my-agent": {
      tts: {
        enabled: true,
        // TTS engine
        engine: "openai",
        // Default voice
        voice: "alloy",
        // Audio format
        format: "opus",  // opus / mp3 / aac / flac
        // Speech rate (0.5-2.0)
        speed: 1.0
      }
    }
  }
}

Supported TTS Engines

Engine	Features	Requirements
`openai`	High quality, multilingual, low latency	Requires OpenAI API Key
`azure`	Enterprise-grade, extensive voice selection	Requires Azure Speech subscription
`elevenlabs`	Extremely natural, supports cloning	Requires ElevenLabs API Key
`edge`	Free, decent quality	No additional configuration needed
`local`	Offline operation, privacy-first	Requires piper or espeak installed

Configuration Examples for Each Engine

OpenAI TTS

{
  tts: {
    engine: "openai",
    voice: "nova",       // alloy, echo, fable, onyx, nova, shimmer
    model: "tts-1-hd",  // tts-1 (fast) or tts-1-hd (high quality)
    apiKey: "${OPENAI_API_KEY}"
  }
}

Azure TTS

{
  tts: {
    engine: "azure",
    voice: "zh-CN-XiaoxiaoNeural",  // Chinese female voice
    region: "eastasia",
    subscriptionKey: "${AZURE_SPEECH_KEY}",
    // Optional: SSML advanced control
    ssml: {
      rate: "medium",
      pitch: "default"
    }
  }
}

ElevenLabs TTS

{
  tts: {
    engine: "elevenlabs",
    voiceId: "21m00Tcm4TlvDq8ikWAM",
    apiKey: "${ELEVENLABS_API_KEY}",
    // Voice parameter tuning
    stability: 0.5,
    similarityBoost: 0.75
  }
}

Edge TTS (Free)

{
  tts: {
    engine: "edge",
    voice: "zh-CN-YunxiNeural",  // Chinese male voice
    // No API Key required
  }
}

Voice Reply Modes

OpenClaw supports three voice reply modes:

1. Voice-Only Mode

The AI sends only voice messages, without text:

{
  tts: {
    enabled: true,
    mode: "voice_only"
  }
}

2. Voice + Text Mode

Sends both voice and text messages simultaneously:

{
  tts: {
    enabled: true,
    mode: "voice_and_text"
  }
}

3. On-Demand Mode (Recommended)

Text is sent by default; users can request voice via a command:

{
  tts: {
    enabled: true,
    mode: "on_demand",
    // Keywords to trigger voice
    triggerWords: ["voice reply", "read it to me", "voice"]
  }
}

User interaction example:

User: voice What's the weather like today?
AI: [sends voice message] It's sunny in Beijing today, with temperatures between 15 and 25 degrees...

Voice Command System

OpenClaw provides a set of /voice commands that let users flexibly control voice features within the chat:

/voice on          - Enable voice replies
/voice off         - Disable voice replies
/voice mode <mode> - Switch mode (voice_only / voice_and_text / on_demand)
/voice speed 1.5   - Adjust speech rate
/voice lang zh     - Set voice language
/voice list        - List available voices
/voice set <name>  - Switch voice character

Configuring Voice Commands

{
  agents: {
    "my-agent": {
      commands: {
        voice: {
          enabled: true,
          // Options users are allowed to adjust
          allowUserControl: {
            speed: true,
            voice: true,
            mode: true,
            language: true
          }
        }
      }
    }
  }
}

Speech-to-Text Input (STT)

OpenClaw not only supports voice output but also speech input recognition. When a user sends a voice message, OpenClaw can automatically convert it to text:

{
  agents: {
    "my-agent": {
      stt: {
        enabled: true,
        engine: "openai",  // Uses the Whisper model
        model: "whisper-1",
        language: "zh",    // Preferred recognition language
        // Whether to display transcription text after recognition
        showTranscription: true
      }
    }
  }
}

Complete voice conversation flow:

User -> [sends voice message] -> OpenClaw STT -> Transcribed to text -> AI processes
                                                                        |
User <- [receives voice reply] <- OpenClaw TTS <- Text reply <- AI generates

Platform Compatibility

Different chat platforms have varying levels of voice message support:

Platform	Voice Sending	Voice Receiving (STT)	Max Duration	Format
Telegram	Supported	Supported	60 minutes	OGG/Opus
Discord	Supported	Supported	Unlimited	Opus
WhatsApp	Supported	Supported	Unlimited	OGG/Opus
Slack	Partial	Supported	Unlimited	MP3
WeChat	Supported	Supported	60 seconds	AMR/Silk
WebChat	Supported	Supported	Unlimited	MP3/WAV

OpenClaw automatically selects the optimal audio format and encoding parameters based on the target platform.

WeChat Special Handling

WeChat voice messages have a 60-second limit. OpenClaw handles this automatically:

{
  tts: {
    platformOverrides: {
      wechat: {
        // Automatically split into multiple voice messages when exceeding 60 seconds
        autoSplit: true,
        // Or fall back to text for overly long content
        fallbackToText: true,
        maxDuration: 55  // 5-second buffer
      }
    }
  }
}

Audio Caching

To reduce TTS API calls and lower latency, OpenClaw supports audio caching:

{
  tts: {
    cache: {
      enabled: true,
      // Cache directory
      dir: "./cache/tts",
      // Cache expiration time (seconds)
      ttl: 86400,  // 24 hours
      // Maximum cache size (MB)
      maxSize: 500
    }
  }
}

Caching strategy: same text + same voice configuration = cache hit, directly returning the previously generated audio file and skipping the TTS API call.

Advanced: SSML Markup

For scenarios that require precise control over voice effects, you can instruct the AI to use SSML markup in the system prompt:

{
  tts: {
    engine: "azure",
    ssmlEnabled: true
  }
}

The AI can embed SSML directives in its replies:

<speak>
  <prosody rate="slow" pitch="+2st">
    This passage will be read at a slower pace and higher pitch.
  </prosody>
  <break time="500ms"/>
  <emphasis level="strong">Key content</emphasis> will be emphasized.
</speak>

Cost Optimization

TTS services charge by character count. Here are some optimization tips:

Enable caching: Avoid regenerating audio for identical content
Use on-demand mode: Not every message needs voice
Limit voice length: Automatically fall back to text for overly long replies
Choose the right engine: Edge TTS is free but lower quality; OpenAI TTS costs money but offers higher quality

{
  tts: {
    // Do not generate voice for replies exceeding 500 characters
    maxChars: 500,
    // Fallback strategy when limit is exceeded
    fallback: "text_with_summary_voice"
    // Send full text first, then a voice summary
  }
}

Summary

OpenClaw's voice features allow AI Agents to break beyond text-only interactions. With flexible TTS engine selection, multiple reply modes, the Voice command system, and speech input recognition, you can build AI assistants that truly support voice conversations. Combined with platform-specific handling and cache optimization, voice features can deliver a great experience while keeping costs under control.