Skip to main content
Access Required: To use this API endpoint, please contact us at contact@memories.ai to enable stream processing features for your account.
This endpoint provides a direct WebSocket proxy to ElevenLabs or AssemblyAI real-time speech-to-text services. Your client connects to our WebSocket gateway, and we transparently forward all frames to/from the upstream provider. You send audio, you receive transcription results — in real time.
When to use this vs /audio-stream/start?
  • Use this WebSocket endpoint when your client can send audio directly (e.g., browser microphone, mobile app).
  • Use /audio-stream/start when you have a stream URL (RTMP/RTSP/HLS) and want the server to pull and process it for you.
Pricing (varies by provider):
ProviderRatePer minute
AssemblyAI$0.15/hour$0.0025
ElevenLabs$0.39/hour$0.0065
Billed every 5 seconds of audio streamed through the WebSocket connection.

Connection

wss://mavi-backend.memories.ai/serve/ws/v2/audio-stream
  ?provider=elevenlabs
  &api_key=sk-mai-xxx
  &model_id=scribe_v2_realtime
  &language_code=en
ParameterRequiredDescription
providerYeselevenlabs or assemblyai
api_keyYesYour API V2 Key (can also be sent via Authorization header)
other paramsNoAll additional query parameters are forwarded to the upstream provider

Authentication

Two options (pick one):
  1. Query parameter: ?api_key=sk-mai-xxx
  2. HTTP header: Authorization: sk-mai-xxx (sent during WebSocket handshake)

Error Codes during Handshake

HTTP StatusReason
401Missing or invalid api_key
403User not authorized for stream processing
400Missing or invalid provider parameter

Provider-Specific Parameters

ElevenLabs

All parameters are passed as URL query strings and forwarded to wss://api.elevenlabs.io/v1/speech-to-text/realtime.
ParameterTypeDefaultDescription
model_idstringscribe_v2_realtimeModel to use for transcription
language_codestring-ISO language code (e.g., en, zh)
tag_audio_eventsbooleanfalseTag audio events like music, laughter
num_speakersinteger-Expected number of speakers
diarizebooleanfalseEnable speaker diarization
enable_loggingbooleanfalseEnable server-side logging
inactivity_timeoutinteger-Session timeout in seconds when no audio is received
Audio format: Send audio as JSON text frames:
{
  "message_type": "input_audio_chunk",
  "audio_base_64": "<base64-encoded PCM audio>",
  "commit": false,
  "sample_rate": 16000
}
End session: Send a commit message:
{
  "message_type": "input_audio_chunk",
  "audio_base_64": "",
  "commit": true
}

AssemblyAI

All parameters are passed as URL query strings and forwarded to wss://streaming.assemblyai.com/v3/ws.
ParameterTypeDefaultDescription
sample_rateinteger16000Audio sample rate in Hz
encodingstringpcm_s16leAudio encoding format
language_codestring-ISO language code
disable_partial_transcriptsbooleanfalseOnly receive final transcripts
enable_extra_session_informationbooleanfalseInclude extra session info
Audio format: Send raw PCM audio as binary frames. End session: Send a text frame:
{
  "type": "session_termination"
}

Data Flow

Client (browser/app)                    Memories.ai Gateway                    Provider (ElevenLabs/AssemblyAI)
       |                                       |                                       |
       |── WebSocket connect ─────────────────>|                                       |
       |   ?provider=elevenlabs&api_key=xxx    |                                       |
       |                                       |── Validate API Key                    |
       |                                       |── Connect upstream ──────────────────>|
       |<── Connection established ────────────|<── Connection established ────────────|
       |                                       |                                       |
       |── Audio frame ──────────────────────->|── Forward audio ────────────────────->|
       |── Audio frame ──────────────────────->|── Forward audio ────────────────────->|
       |                                       |                                       |
       |<── Transcription result ──────────────|<── Transcription result ──────────────|
       |<── Transcription result ──────────────|<── Transcription result ──────────────|
       |                                       |                                       |
       |── Close ─────────────────────────────>|── Close ─────────────────────────────>|

Code Examples

import asyncio
import websockets
import json
import base64

WS_URL = "wss://mavi-backend.memories.ai/serve/ws/v2/audio-stream"
API_KEY = "sk-mai-your_api_key"

async def stream_audio():
    params = (
        f"?provider=elevenlabs"
        f"&api_key={API_KEY}"
        f"&model_id=scribe_v2_realtime"
        f"&language_code=en"
    )

    async with websockets.connect(WS_URL + params) as ws:
        # Start receiving task
        async def receive():
            async for message in ws:
                data = json.loads(message)
                print(json.dumps(data, indent=2))

        recv_task = asyncio.create_task(receive())

        # Send audio chunks (16kHz, 16-bit, mono PCM)
        with open("audio.pcm", "rb") as f:
            while chunk := f.read(3200):  # 100ms chunks
                msg = json.dumps({
                    "message_type": "input_audio_chunk",
                    "audio_base_64": base64.b64encode(chunk).decode(),
                    "commit": False,
                    "sample_rate": 16000
                })
                await ws.send(msg)
                await asyncio.sleep(0.1)

        # Send commit to finalize
        await ws.send(json.dumps({
            "message_type": "input_audio_chunk",
            "audio_base_64": "",
            "commit": True
        }))

        await recv_task

asyncio.run(stream_audio())

Response Messages

All transcription results from the upstream provider are forwarded to your client as-is (verbatim JSON). The message format depends on the selected provider.

ElevenLabs Response Example

{
  "message_type": "transcript",
  "language_code": "en",
  "language_probability": 0.98,
  "text": "Hello, how are you?",
  "words": [
    {
      "text": "Hello",
      "start": 0.0,
      "end": 0.5,
      "type": "word"
    }
  ]
}

AssemblyAI Response Example

{
  "type": "final_transcript",
  "text": "Hello, how are you?",
  "words": [
    {
      "text": "Hello",
      "start": 100,
      "end": 500,
      "confidence": 0.99
    }
  ],
  "created": "2025-01-01T00:00:00.000Z"
}

Important Notes

  • Per-provider billing: AssemblyAI is billed at 0.15/hour,ElevenLabsat0.15/hour, ElevenLabs at 0.39/hour. Charges are calculated every 5 seconds of audio streamed.
  • Transparent proxy: All frames are forwarded as-is in both directions. The gateway does not modify any data.
  • Concurrent connections: Subject to your account’s stream processing limits.
  • Unsupported parameters are ignored: If you pass a parameter that doesn’t apply to the selected provider, the provider will silently ignore it.
  • Recommended audio format: 16kHz sample rate, 16-bit signed little-endian, mono channel (PCM s16le).