When to use this vs
/audio-stream/start?- Use this WebSocket endpoint when your client can send audio directly (e.g., browser microphone, mobile app).
- Use
/audio-stream/startwhen you have a stream URL (RTMP/RTSP/HLS) and want the server to pull and process it for you.
Pricing (varies by provider):
Billed every 5 seconds of audio streamed through the WebSocket connection.
| Provider | Rate | Per minute |
|---|---|---|
| AssemblyAI | $0.15/hour | $0.0025 |
| ElevenLabs | $0.39/hour | $0.0065 |
Connection
| Parameter | Required | Description |
|---|---|---|
| provider | Yes | elevenlabs or assemblyai |
| api_key | Yes | Your API V2 Key (can also be sent via Authorization header) |
| other params | No | All additional query parameters are forwarded to the upstream provider |
Authentication
Two options (pick one):- Query parameter:
?api_key=sk-mai-xxx - HTTP header:
Authorization: sk-mai-xxx(sent during WebSocket handshake)
Error Codes during Handshake
| HTTP Status | Reason |
|---|---|
| 401 | Missing or invalid api_key |
| 403 | User not authorized for stream processing |
| 400 | Missing or invalid provider parameter |
Provider-Specific Parameters
ElevenLabs
All parameters are passed as URL query strings and forwarded towss://api.elevenlabs.io/v1/speech-to-text/realtime.
| Parameter | Type | Default | Description |
|---|---|---|---|
| model_id | string | scribe_v2_realtime | Model to use for transcription |
| language_code | string | - | ISO language code (e.g., en, zh) |
| tag_audio_events | boolean | false | Tag audio events like music, laughter |
| num_speakers | integer | - | Expected number of speakers |
| diarize | boolean | false | Enable speaker diarization |
| enable_logging | boolean | false | Enable server-side logging |
| inactivity_timeout | integer | - | Session timeout in seconds when no audio is received |
AssemblyAI
All parameters are passed as URL query strings and forwarded towss://streaming.assemblyai.com/v3/ws.
| Parameter | Type | Default | Description |
|---|---|---|---|
| sample_rate | integer | 16000 | Audio sample rate in Hz |
| encoding | string | pcm_s16le | Audio encoding format |
| language_code | string | - | ISO language code |
| disable_partial_transcripts | boolean | false | Only receive final transcripts |
| enable_extra_session_information | boolean | false | Include extra session info |
Data Flow
Code Examples
Response Messages
All transcription results from the upstream provider are forwarded to your client as-is (verbatim JSON). The message format depends on the selected provider.ElevenLabs Response Example
AssemblyAI Response Example
Important Notes
- Per-provider billing: AssemblyAI is billed at 0.39/hour. Charges are calculated every 5 seconds of audio streamed.
- Transparent proxy: All frames are forwarded as-is in both directions. The gateway does not modify any data.
- Concurrent connections: Subject to your account’s stream processing limits.
- Unsupported parameters are ignored: If you pass a parameter that doesn’t apply to the selected provider, the provider will silently ignore it.
- Recommended audio format: 16kHz sample rate, 16-bit signed little-endian, mono channel (PCM s16le).
