Home Tutorials Categories Skills About
ZH EN JA KO
Advanced

OpenClaw Voice Synthesis and TTS Configuration

· 16 min read

Introduction

Text is not the only form of AI interaction. OpenClaw has built-in Text-to-Speech (TTS) support, allowing your AI Agent to reply to users with voice messages. Whether it is sending voice notes on Telegram or reading replies aloud in a Discord voice channel, OpenClaw can achieve this with simple configuration.

This article covers TTS configuration methods, supported engines, and advanced usage in detail.

Enabling TTS

Basic Configuration

Enable TTS in openclaw.json5:

{
  agents: {
    "my-agent": {
      tts: {
        enabled: true,
        // TTS engine
        engine: "openai",
        // Default voice
        voice: "alloy",
        // Audio format
        format: "opus",  // opus / mp3 / aac / flac
        // Speech rate (0.5-2.0)
        speed: 1.0
      }
    }
  }
}

Supported TTS Engines

Engine Features Requirements
openai High quality, multilingual, low latency Requires OpenAI API Key
azure Enterprise-grade, extensive voice selection Requires Azure Speech subscription
elevenlabs Extremely natural, supports cloning Requires ElevenLabs API Key
edge Free, decent quality No additional configuration needed
local Offline operation, privacy-first Requires piper or espeak installed

Configuration Examples for Each Engine

OpenAI TTS

{
  tts: {
    engine: "openai",
    voice: "nova",       // alloy, echo, fable, onyx, nova, shimmer
    model: "tts-1-hd",  // tts-1 (fast) or tts-1-hd (high quality)
    apiKey: "${OPENAI_API_KEY}"
  }
}

Azure TTS

{
  tts: {
    engine: "azure",
    voice: "zh-CN-XiaoxiaoNeural",  // Chinese female voice
    region: "eastasia",
    subscriptionKey: "${AZURE_SPEECH_KEY}",
    // Optional: SSML advanced control
    ssml: {
      rate: "medium",
      pitch: "default"
    }
  }
}

ElevenLabs TTS

{
  tts: {
    engine: "elevenlabs",
    voiceId: "21m00Tcm4TlvDq8ikWAM",
    apiKey: "${ELEVENLABS_API_KEY}",
    // Voice parameter tuning
    stability: 0.5,
    similarityBoost: 0.75
  }
}

Edge TTS (Free)

{
  tts: {
    engine: "edge",
    voice: "zh-CN-YunxiNeural",  // Chinese male voice
    // No API Key required
  }
}

Voice Reply Modes

OpenClaw supports three voice reply modes:

1. Voice-Only Mode

The AI sends only voice messages, without text:

{
  tts: {
    enabled: true,
    mode: "voice_only"
  }
}

2. Voice + Text Mode

Sends both voice and text messages simultaneously:

{
  tts: {
    enabled: true,
    mode: "voice_and_text"
  }
}

3. On-Demand Mode (Recommended)

Text is sent by default; users can request voice via a command:

{
  tts: {
    enabled: true,
    mode: "on_demand",
    // Keywords to trigger voice
    triggerWords: ["voice reply", "read it to me", "voice"]
  }
}

User interaction example:

User: voice What's the weather like today?
AI: [sends voice message] It's sunny in Beijing today, with temperatures between 15 and 25 degrees...

Voice Command System

OpenClaw provides a set of /voice commands that let users flexibly control voice features within the chat:

/voice on          - Enable voice replies
/voice off         - Disable voice replies
/voice mode <mode> - Switch mode (voice_only / voice_and_text / on_demand)
/voice speed 1.5   - Adjust speech rate
/voice lang zh     - Set voice language
/voice list        - List available voices
/voice set <name>  - Switch voice character

Configuring Voice Commands

{
  agents: {
    "my-agent": {
      commands: {
        voice: {
          enabled: true,
          // Options users are allowed to adjust
          allowUserControl: {
            speed: true,
            voice: true,
            mode: true,
            language: true
          }
        }
      }
    }
  }
}

Speech-to-Text Input (STT)

OpenClaw not only supports voice output but also speech input recognition. When a user sends a voice message, OpenClaw can automatically convert it to text:

{
  agents: {
    "my-agent": {
      stt: {
        enabled: true,
        engine: "openai",  // Uses the Whisper model
        model: "whisper-1",
        language: "zh",    // Preferred recognition language
        // Whether to display transcription text after recognition
        showTranscription: true
      }
    }
  }
}

Complete voice conversation flow:

User -> [sends voice message] -> OpenClaw STT -> Transcribed to text -> AI processes
                                                                        |
User <- [receives voice reply] <- OpenClaw TTS <- Text reply <- AI generates

Platform Compatibility

Different chat platforms have varying levels of voice message support:

Platform Voice Sending Voice Receiving (STT) Max Duration Format
Telegram Supported Supported 60 minutes OGG/Opus
Discord Supported Supported Unlimited Opus
WhatsApp Supported Supported Unlimited OGG/Opus
Slack Partial Supported Unlimited MP3
WeChat Supported Supported 60 seconds AMR/Silk
WebChat Supported Supported Unlimited MP3/WAV

OpenClaw automatically selects the optimal audio format and encoding parameters based on the target platform.

WeChat Special Handling

WeChat voice messages have a 60-second limit. OpenClaw handles this automatically:

{
  tts: {
    platformOverrides: {
      wechat: {
        // Automatically split into multiple voice messages when exceeding 60 seconds
        autoSplit: true,
        // Or fall back to text for overly long content
        fallbackToText: true,
        maxDuration: 55  // 5-second buffer
      }
    }
  }
}

Audio Caching

To reduce TTS API calls and lower latency, OpenClaw supports audio caching:

{
  tts: {
    cache: {
      enabled: true,
      // Cache directory
      dir: "./cache/tts",
      // Cache expiration time (seconds)
      ttl: 86400,  // 24 hours
      // Maximum cache size (MB)
      maxSize: 500
    }
  }
}

Caching strategy: same text + same voice configuration = cache hit, directly returning the previously generated audio file and skipping the TTS API call.

Advanced: SSML Markup

For scenarios that require precise control over voice effects, you can instruct the AI to use SSML markup in the system prompt:

{
  tts: {
    engine: "azure",
    ssmlEnabled: true
  }
}

The AI can embed SSML directives in its replies:

<speak>
  <prosody rate="slow" pitch="+2st">
    This passage will be read at a slower pace and higher pitch.
  </prosody>
  <break time="500ms"/>
  <emphasis level="strong">Key content</emphasis> will be emphasized.
</speak>

Cost Optimization

TTS services charge by character count. Here are some optimization tips:

  1. Enable caching: Avoid regenerating audio for identical content
  2. Use on-demand mode: Not every message needs voice
  3. Limit voice length: Automatically fall back to text for overly long replies
  4. Choose the right engine: Edge TTS is free but lower quality; OpenAI TTS costs money but offers higher quality
{
  tts: {
    // Do not generate voice for replies exceeding 500 characters
    maxChars: 500,
    // Fallback strategy when limit is exceeded
    fallback: "text_with_summary_voice"
    // Send full text first, then a voice summary
  }
}

Summary

OpenClaw's voice features allow AI Agents to break beyond text-only interactions. With flexible TTS engine selection, multiple reply modes, the Voice command system, and speech input recognition, you can build AI assistants that truly support voice conversations. Combined with platform-specific handling and cache optimization, voice features can deliver a great experience while keeping costs under control.

OpenClaw is a free, open-source personal AI assistant that supports WhatsApp, Telegram, Discord, and many more platforms