Introduction
Text is not the only form of AI interaction. OpenClaw has built-in Text-to-Speech (TTS) support, allowing your AI Agent to reply to users with voice messages. Whether it is sending voice notes on Telegram or reading replies aloud in a Discord voice channel, OpenClaw can achieve this with simple configuration.
This article covers TTS configuration methods, supported engines, and advanced usage in detail.
Enabling TTS
Basic Configuration
Enable TTS in openclaw.json5:
{
agents: {
"my-agent": {
tts: {
enabled: true,
// TTS engine
engine: "openai",
// Default voice
voice: "alloy",
// Audio format
format: "opus", // opus / mp3 / aac / flac
// Speech rate (0.5-2.0)
speed: 1.0
}
}
}
}
Supported TTS Engines
| Engine | Features | Requirements |
|---|---|---|
openai |
High quality, multilingual, low latency | Requires OpenAI API Key |
azure |
Enterprise-grade, extensive voice selection | Requires Azure Speech subscription |
elevenlabs |
Extremely natural, supports cloning | Requires ElevenLabs API Key |
edge |
Free, decent quality | No additional configuration needed |
local |
Offline operation, privacy-first | Requires piper or espeak installed |
Configuration Examples for Each Engine
OpenAI TTS
{
tts: {
engine: "openai",
voice: "nova", // alloy, echo, fable, onyx, nova, shimmer
model: "tts-1-hd", // tts-1 (fast) or tts-1-hd (high quality)
apiKey: "${OPENAI_API_KEY}"
}
}
Azure TTS
{
tts: {
engine: "azure",
voice: "zh-CN-XiaoxiaoNeural", // Chinese female voice
region: "eastasia",
subscriptionKey: "${AZURE_SPEECH_KEY}",
// Optional: SSML advanced control
ssml: {
rate: "medium",
pitch: "default"
}
}
}
ElevenLabs TTS
{
tts: {
engine: "elevenlabs",
voiceId: "21m00Tcm4TlvDq8ikWAM",
apiKey: "${ELEVENLABS_API_KEY}",
// Voice parameter tuning
stability: 0.5,
similarityBoost: 0.75
}
}
Edge TTS (Free)
{
tts: {
engine: "edge",
voice: "zh-CN-YunxiNeural", // Chinese male voice
// No API Key required
}
}
Voice Reply Modes
OpenClaw supports three voice reply modes:
1. Voice-Only Mode
The AI sends only voice messages, without text:
{
tts: {
enabled: true,
mode: "voice_only"
}
}
2. Voice + Text Mode
Sends both voice and text messages simultaneously:
{
tts: {
enabled: true,
mode: "voice_and_text"
}
}
3. On-Demand Mode (Recommended)
Text is sent by default; users can request voice via a command:
{
tts: {
enabled: true,
mode: "on_demand",
// Keywords to trigger voice
triggerWords: ["voice reply", "read it to me", "voice"]
}
}
User interaction example:
User: voice What's the weather like today?
AI: [sends voice message] It's sunny in Beijing today, with temperatures between 15 and 25 degrees...
Voice Command System
OpenClaw provides a set of /voice commands that let users flexibly control voice features within the chat:
/voice on - Enable voice replies
/voice off - Disable voice replies
/voice mode <mode> - Switch mode (voice_only / voice_and_text / on_demand)
/voice speed 1.5 - Adjust speech rate
/voice lang zh - Set voice language
/voice list - List available voices
/voice set <name> - Switch voice character
Configuring Voice Commands
{
agents: {
"my-agent": {
commands: {
voice: {
enabled: true,
// Options users are allowed to adjust
allowUserControl: {
speed: true,
voice: true,
mode: true,
language: true
}
}
}
}
}
}
Speech-to-Text Input (STT)
OpenClaw not only supports voice output but also speech input recognition. When a user sends a voice message, OpenClaw can automatically convert it to text:
{
agents: {
"my-agent": {
stt: {
enabled: true,
engine: "openai", // Uses the Whisper model
model: "whisper-1",
language: "zh", // Preferred recognition language
// Whether to display transcription text after recognition
showTranscription: true
}
}
}
}
Complete voice conversation flow:
User -> [sends voice message] -> OpenClaw STT -> Transcribed to text -> AI processes
|
User <- [receives voice reply] <- OpenClaw TTS <- Text reply <- AI generates
Platform Compatibility
Different chat platforms have varying levels of voice message support:
| Platform | Voice Sending | Voice Receiving (STT) | Max Duration | Format |
|---|---|---|---|---|
| Telegram | Supported | Supported | 60 minutes | OGG/Opus |
| Discord | Supported | Supported | Unlimited | Opus |
| Supported | Supported | Unlimited | OGG/Opus | |
| Slack | Partial | Supported | Unlimited | MP3 |
| Supported | Supported | 60 seconds | AMR/Silk | |
| WebChat | Supported | Supported | Unlimited | MP3/WAV |
OpenClaw automatically selects the optimal audio format and encoding parameters based on the target platform.
WeChat Special Handling
WeChat voice messages have a 60-second limit. OpenClaw handles this automatically:
{
tts: {
platformOverrides: {
wechat: {
// Automatically split into multiple voice messages when exceeding 60 seconds
autoSplit: true,
// Or fall back to text for overly long content
fallbackToText: true,
maxDuration: 55 // 5-second buffer
}
}
}
}
Audio Caching
To reduce TTS API calls and lower latency, OpenClaw supports audio caching:
{
tts: {
cache: {
enabled: true,
// Cache directory
dir: "./cache/tts",
// Cache expiration time (seconds)
ttl: 86400, // 24 hours
// Maximum cache size (MB)
maxSize: 500
}
}
}
Caching strategy: same text + same voice configuration = cache hit, directly returning the previously generated audio file and skipping the TTS API call.
Advanced: SSML Markup
For scenarios that require precise control over voice effects, you can instruct the AI to use SSML markup in the system prompt:
{
tts: {
engine: "azure",
ssmlEnabled: true
}
}
The AI can embed SSML directives in its replies:
<speak>
<prosody rate="slow" pitch="+2st">
This passage will be read at a slower pace and higher pitch.
</prosody>
<break time="500ms"/>
<emphasis level="strong">Key content</emphasis> will be emphasized.
</speak>
Cost Optimization
TTS services charge by character count. Here are some optimization tips:
- Enable caching: Avoid regenerating audio for identical content
- Use on-demand mode: Not every message needs voice
- Limit voice length: Automatically fall back to text for overly long replies
- Choose the right engine: Edge TTS is free but lower quality; OpenAI TTS costs money but offers higher quality
{
tts: {
// Do not generate voice for replies exceeding 500 characters
maxChars: 500,
// Fallback strategy when limit is exceeded
fallback: "text_with_summary_voice"
// Send full text first, then a voice summary
}
}
Summary
OpenClaw's voice features allow AI Agents to break beyond text-only interactions. With flexible TTS engine selection, multiple reply modes, the Voice command system, and speech input recognition, you can build AI assistants that truly support voice conversations. Combined with platform-specific handling and cache optimization, voice features can deliver a great experience while keeping costs under control.