OpenClaw Speech-to-Text: Deepgram Transcription Service Configuration

The Challenge of Voice Message Processing

In chat platforms, users frequently send voice messages. However, large language models can typically only process text input and cannot directly understand audio content. This means that without speech-to-text preprocessing, the AI assistant would be unable to respond to voice messages.

OpenClaw solves this problem by integrating the Deepgram speech transcription service. Deepgram is a professional automatic speech recognition (ASR) platform known for its high accuracy, low latency, and reasonable pricing. When a user sends a voice message, OpenClaw automatically sends the audio content to Deepgram for transcription, then passes the transcribed text to the language model for processing. The entire process is completely transparent to the user.

Technical Advantages of Deepgram

There are several reasons for choosing Deepgram as the speech transcription engine:

High Accuracy: Deepgram uses end-to-end deep learning models that maintain high recognition accuracy across various accents and noisy environments.
Low Latency: Supports real-time streaming transcription and fast batch transcription, capable of processing audio within seconds.
Multi-language Support: Supports speech recognition in dozens of languages, including Chinese, English, Japanese, and more.
Cost Effective: Compared to similar services, Deepgram offers competitive pricing.

Obtaining a Deepgram API Key

Visit the Deepgram website (deepgram.com) and register an account.
Create a new project in the console.
Generate a new key on the API Keys page in project settings.
Select the appropriate permission scope (at minimum, transcription permission is required).
Save your API key.

Deepgram provides new users with a certain amount of free transcription time, which is sufficient for testing.

Configuring Deepgram

Using the Onboard Tool

openclaw onboard

During the guided flow, when the system asks whether to configure a speech transcription service, select "Yes," then choose Deepgram and enter your API key.

Manual Configuration

Add the Deepgram configuration in openclaw.json:

{
  "transcription": {
    "provider": "deepgram",
    "auth": {
      "key": "your-deepgram-api-key"
    }
  }
}

Note that Deepgram's configuration is placed differently from LLM providers. It goes in the transcription section of the config file rather than the providers section, because Deepgram provides a speech transcription service rather than a conversational model.

Advanced Configuration Options

You can fine-tune the transcription behavior:

{
  "transcription": {
    "provider": "deepgram",
    "auth": {
      "key": "your-deepgram-api-key"
    },
    "model": "nova-2",
    "language": "zh",
    "punctuate": true,
    "smart_format": true
  }
}

model: Select the Deepgram transcription model. nova-2 is the currently recommended latest model.
language: Specify the primary language of the audio. Set to "zh" to optimize for Chinese recognition.
punctuate: Automatically add punctuation marks.
smart_format: Enable smart formatting for automatic handling of numbers, dates, and other formats.

Workflow Details

Once Deepgram is configured, the voice message processing flow works as follows:

The user sends a voice message on a chat platform (e.g., Discord, Telegram).
OpenClaw receives the voice message and detects that it is audio content.
OpenClaw sends the audio data to Deepgram's transcription API.
Deepgram returns the transcribed text.
OpenClaw passes the transcribed text as user input to the configured language model.
The language model generates a response, and OpenClaw sends the reply back to the chat platform.

The entire process typically completes within a few seconds — the user simply sends a voice message and receives a text reply from the AI.

Multi-language Speech Recognition

If your user base speaks multiple languages, Deepgram supports automatic language detection:

{
  "transcription": {
    "provider": "deepgram",
    "auth": {
      "key": "your-api-key"
    },
    "model": "nova-2",
    "detect_language": true
  }
}

With detect_language enabled, Deepgram will automatically identify the language in the audio and transcribe it, without needing to specify the language in advance.

Working with Language Models

The Deepgram transcription service can work with any language model supported by OpenClaw. A typical complete configuration looks like this:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-opus-4-5"
      }
    }
  },
  "transcription": {
    "provider": "deepgram",
    "auth": {
      "key": "deepgram-key"
    }
  }
}

With this configuration, text messages are processed directly by Claude, while voice messages are first transcribed by Deepgram before being passed to Claude.

Cost Considerations

Deepgram charges based on audio duration. Tips for controlling costs:

Set a reasonable maximum voice duration limit to avoid processing excessively long voice messages.
Monitor usage statistics in the Deepgram console and adjust quotas accordingly.
For low-frequency use cases, Deepgram's free tier may be sufficient.

Verifying the Configuration

After configuration is complete, send a voice message on your chat platform to test. If everything is configured correctly, you should receive a text reply from the AI based on your voice content. Checking the OpenClaw logs will show the detailed process of voice transcription and model invocation.