Start Audio Stream Transcription

Access Required: To use this API endpoint, please contact us at contact@memories.ai to enable stream processing features for your account.

This endpoint starts real-time audio stream transcription. The server pulls audio from the provided stream URL, decodes it to PCM via FFmpeg, and streams it to the selected provider (ElevenLabs or AssemblyAI) over WebSocket. Every message returned by the provider is forwarded verbatim to your webhook callback. Billing occurs every 5 seconds of audio streamed.

When to use this vs the WebSocket endpoint?

Use this HTTP endpoint when you have a stream URL (RTMP/RTSP/HLS) and want the server to handle audio decoding and streaming.
Use the WebSocket endpoint when your client can send audio directly (e.g., browser microphone).

Pricing (varies by provider):

Provider	Rate	Per 5s billing cycle	Per minute
AssemblyAI	$0.15/hour	$0.000208	$0.0025
ElevenLabs	$0.39/hour	$0.000542	$0.0065

Cost (USD) = duration × rate / 3600

Where:

duration: Audio duration in seconds
rate: $0.15 (AssemblyAI) or$ 0.39 (ElevenLabs)
Charges: pre-check at start, then billed every 5 seconds of audio streamed

Supported Protocols

RTMP (Recommended)
RTSP
HLS (.m3u8)
HTTP/HTTPS (direct audio URLs)

Key Features

Real-time transcription via ElevenLabs or AssemblyAI (controlled by provider parameter)
Server-side audio decoding (FFmpeg) — no client-side processing needed
Verbatim callback: every upstream message forwarded as-is to your webhook
Real-time billing every 5 seconds of audio
Auto-stop on insufficient balance (status 402)
All provider-specific parameters transparently forwarded

Architecture

                        Your Server
                            |
                     POST /audio-stream/start
                     { audio_url, provider, ... }
                            |
                            v
                   +------------------+
                   | Memories.ai      |
                   |                  |
audio_url -------> | FFmpeg (decode)  |
                   |     |            |
                   |     v            |
                   | PCM 16kHz mono   |
                   |     |            |
                   |     v            |
                   | WebSocket -------+-------> ElevenLabs / AssemblyAI
                   |                  |
                   |  <-- messages ---|<------- Provider responses
                   |     |            |
                   |     v            |
                   | Webhook callback |-------> Your callback URL
                   +------------------+

Code Example

import requests

BASE_URL = "https://mavi-backend.memories.ai/serve/api/v2"
API_KEY = "sk-mai-this_a_test_string_please_use_your_generated_key_during_testing"
HEADERS = {
    "Authorization": f"{API_KEY}",
    "Content-Type": "application/json"
}

def start_audio_stream(audio_url: str):
    url = f"{BASE_URL}/audio-stream/start"
    data = {
        "audio_url": audio_url,
        "provider": "elevenlabs",
        "language_code": "en",
        "model_id": "scribe_v2_realtime",
        "diarize": True,
        "num_speakers": 2
    }
    resp = requests.post(url, json=data, headers=HEADERS)
    return resp.json()

result = start_audio_stream("rtmp://example.com/live/audio")
print(result)
print(f"Task ID: {result['data']['task_id']}")

Response

Returns the task information for the started audio stream.

{
  "code": 200,
  "message": "success",
  "data": {
    "task_id": "660e8400-e29b-41d4-a716-446655440001",
    "message": "Audio stream transcription started"
  }
}

Request Parameters

Required

Parameter	Type	Description
audio_url	string	Audio stream URL (RTMP, RTSP, HLS, or HTTP)
provider	string	Transcription provider: `elevenlabs` or `assemblyai`

Common Parameters

Parameter	Type	Default	Description
language_code	string	-	Language code for transcription (e.g., `en`, `zh`, `es`, `fr`)

ElevenLabs Parameters

These parameters are forwarded when provider=elevenlabs.

Parameter	Type	Default	Description
model_id	string	scribe_v2_realtime	Model to use for transcription
tag_audio_events	boolean	false	Tag audio events (music, laughter, etc.)
num_speakers	integer	-	Expected number of speakers for diarization
diarize	boolean	false	Enable speaker diarization
enable_logging	boolean	false	Enable server-side logging
inactivity_timeout	integer	-	Session timeout (seconds) when no audio is received

AssemblyAI Parameters

These parameters are forwarded when provider=assemblyai.

Parameter	Type	Default	Description
punctuate	boolean	false	Add punctuation to the transcript
format_text	boolean	false	Format text in the transcript (numbers, dates)
language_detection	boolean	false	Enable automatic language detection
language_confidence_threshold	number	0.0	Confidence threshold for language detection (0.0-1.0)
audio_start_from	integer	0	Start transcription from this timestamp (ms)
audio_end_at	integer	0	End transcription at this timestamp (ms)
multichannel	boolean	false	Enable multi-channel audio processing
speech_models	array	-	Array of speech models to use
speech_threshold	number	0.0	Speech detection threshold (0.0-1.0)
disfluencies	boolean	false	Include disfluencies (um, uh) in transcript
speaker_labels	boolean	false	Enable speaker diarization
speakers_expected	integer	0	Expected number of speakers
sentiment_analysis	boolean	false	Enable sentiment analysis
entity_detection	boolean	false	Enable entity detection
auto_highlights	boolean	false	Enable automatic highlights extraction
content_safety	boolean	false	Enable content safety detection
iab_categories	boolean	false	Enable IAB category classification
auto_chapters	boolean	false	Enable automatic chapter generation
summarization	boolean	false	Enable automatic summarization
summary_model	string	-	Model for summarization (`informative` or `conversational`)
summary_type	string	-	Summary type (`bullets` or `paragraph`)
custom_topics	boolean	false	Enable custom topic detection
topics	array	-	Array of custom topics to detect
redact_pii	boolean	false	Redact personally identifiable information
redact_pii_sub	string	-	PII redaction substitution method
redact_pii_policies	array	-	Array of PII policies to apply
redact_pii_audio	boolean	false	Redact PII from audio
redact_pii_audio_quality	string	-	Quality for PII audio redaction
filter_profanity	boolean	false	Filter profanity from transcript
custom_spelling	array	-	Array of custom spelling corrections
speech_understanding	object	-	Speech understanding configuration

Any parameters not listed above can still be passed in the request body. They will be captured and forwarded to the upstream provider via the URL query string. This ensures forward compatibility with new provider features.

Response Parameters

Parameter	Type	Description
code	string	Response code (200 indicates success)
message	string	Response message
data.task_id	string	Unique identifier of the transcription task
data.message	string	Status message about the stream start

Callback Response Parameters

Callbacks are sent continuously — one for each message received from the upstream provider.

Parameter	Type	Description
code	string	Response code (200 indicates callback delivery success)
message	string	Response message (“SUCCESS”)
task_id	string	The task ID associated with this stream
data.status	integer	Status code (0 for normal message, see Status Codes below)
data.message	string	Status message (null for normal messages)
data.transcript	object	Verbatim JSON from the upstream provider (null for error/control statuses)

The data.transcript field contains the raw, unmodified response from the selected provider. The structure differs between ElevenLabs and AssemblyAI. Refer to each provider’s documentation for detailed field descriptions.

Status Codes

Status	Name	Description	Stream Continues
0	Message	Normal transcription message from provider	Yes
-1	Error	Processing or connection error	No
14	User Stopped	User called `/audio-stream/stop`	No
16	Capacity Reached	Server capacity limit reached	No
402	Insufficient Balance	User balance insufficient	No

Important Notes

provider is required: You must specify elevenlabs or assemblyai. Without it the request will fail.
Webhook required: Configure your webhook URL in user settings before using this API.
Verbatim callbacks: Each callback contains the exact JSON message from the provider — the server does not transform or aggregate the data.
Real-time billing: Billing occurs every 5 seconds of audio data streamed to the provider. Auto-stops when balance is insufficient.
Pre-charge: Balance is checked at start (one 5-second unit). If insufficient, the task is not started (status 402).
Parameter transparency: Parameters specific to a provider are forwarded as-is. Parameters not relevant to the selected provider are silently ignored by the provider.

Supported Languages

Common language codes (supported by both providers):

en - English
zh - Chinese
es - Spanish
fr - French
de - German
ja - Japanese
ko - Korean
And many more…

Rate Limiting

Maximum concurrent streams: Each user can run N concurrent stream tasks (video + audio combined)
Capacity check: Returns status 16 if server capacity is reached
Balance check: Returns status 402 if insufficient balance at start

Authorizations

Authorization

string

header

required

Body

application/json

audio_url

string

required

Audio stream URL (RTMP or RTSP protocol)

Example:

"rtmp://example.com/live/audio"

language_code

string

Language code for transcription

Example:

"en"

punctuate

boolean

default:false

Add punctuation to the transcript

format_text

boolean

default:false

Format text in the transcript

language_detection

boolean

default:false

Enable automatic language detection

language_confidence_threshold

number

default:0

Confidence threshold for language detection

audio_start_from

integer

default:0

Start transcription from this timestamp (milliseconds)

audio_end_at

integer

default:0

End transcription at this timestamp (milliseconds)

multichannel

boolean

default:false

Enable multi-channel audio processing

speech_models

string[]

Array of speech models to use

speech_threshold

number

default:0

Speech detection threshold (0.0-1.0)

disfluencies

boolean

default:false

Include disfluencies in the transcript

speaker_labels

boolean

default:false

Enable speaker diarization

speakers_expected

integer

default:0

Expected number of speakers

sentiment_analysis

boolean

default:false

Enable sentiment analysis

entity_detection

boolean

default:false

Enable entity detection

auto_highlights

boolean

default:false

Enable automatic highlights extraction

content_safety

boolean

default:false

Enable content safety detection

iab_categories

boolean

default:false

Enable IAB category classification

auto_chapters

boolean

default:false

Enable automatic chapter generation

summarization

boolean

default:false

Enable automatic summarization

summary_model

string

Model to use for summarization

summary_type

string

Type of summary to generate

custom_topics

boolean

default:false

Enable custom topic detection

topics

string[]

Array of custom topics to detect

redact_pii

boolean

default:false

Redact personally identifiable information

redact_pii_sub

string

PII redaction substitution method

redact_pii_policies

string[]

Array of PII policies to apply

redact_pii_audio

boolean

default:false

Redact PII from audio

redact_pii_audio_quality

string

Quality setting for PII audio redaction

filter_profanity

boolean

default:false

Filter profanity from transcript

custom_spelling

object[]

Array of custom spelling corrections

speech_understanding

object

Speech understanding configuration

Response

200 - application/json

Audio stream started successfully

code

string

Example:

200

message

string

Example:

"success"

data

object

Show child attributes

Getting Started

Video Processing

Transcription

Social Media Scraping

Video Understanding Models

Image Understanding Models

Embeddings

Stream Processing

Screenplay Extraction

Supported Protocols

Key Features

Architecture

Code Example

Response

Request Parameters

Required

Common Parameters

ElevenLabs Parameters

AssemblyAI Parameters

Response Parameters

Callback Response Parameters

Status Codes

Important Notes

Supported Languages

Rate Limiting

Authorizations

Body

Response

Getting Started

Video Processing

Transcription

Social Media Scraping

Video Understanding Models

Image Understanding Models

Embeddings

Stream Processing

Screenplay Extraction

Documentation Index

​Supported Protocols

​Key Features

​Architecture

​Code Example

​Response

​Request Parameters

​Required

​Common Parameters

​ElevenLabs Parameters

​AssemblyAI Parameters

​Response Parameters

​Callback Response Parameters

​Status Codes

​Important Notes

​Supported Languages

​Rate Limiting

Authorizations

Body

Response

Supported Protocols

Key Features

Architecture

Code Example

Response

Request Parameters

Required

Common Parameters

ElevenLabs Parameters

AssemblyAI Parameters

Response Parameters

Callback Response Parameters

Status Codes

Important Notes

Supported Languages

Rate Limiting