Documentation Index
Fetch the complete documentation index at: https://api-tools.memories.ai/llms.txt
Use this file to discover all available pages before exploring further.
Product: Visual Intelligence — Audio File Transcription
Use case: Transcribe an uploaded audio/video file to text — async batch or sync, multiple providers (Whisper, ElevenLabs, AssemblyAI) with optional speaker labels. For live streams, see Live Audio Transcription.
Host: https://mavi-backend.memories.ai/serve/api/v2
Auth: Authorization: sk-mavi-... (no Bearer prefix)
Uses ElevenLabs Scribe V2 model. Returns results synchronously.
Pricing: $0.39/hour of audio, billed by actual audio duration (in seconds).
Audio Source
You must provide one of the following (priority: asset_id > url > source_url).
Parameters
API key for authentication (e.g. sk-mavi-...).
provider
string
default:"elevenlabs"
STT provider. Use elevenlabs for this endpoint.
The unique identifier of an uploaded audio/video asset (e.g. re_xxx). Resolved to a signed GCS URL.
A publicly accessible audio URL.
A gs:// GCS path or public HTTP URL. GCS paths are converted to signed URLs automatically.
Language code (ISO 639-1, e.g. en, zh). If omitted, the provider auto-detects the language.
model_id
string
default:"scribe_v2"
Model to use.
Enable speaker diarization.
Timestamp level: none, segment, or word.
Tag audio events such as music, laughter, applause.
Expected number of speakers (improves diarization).
Audio format hint (e.g. pcm_s16le_16000).
Source language for translation.
Target language for translation.
Code Examples
curl --request POST \
--url https://mavi-backend.memories.ai/serve/api/v2/transcriptions/speech-to-text \
--header 'Authorization: sk-mavi-...' \
--header 'Content-Type: application/json' \
--data '{
"provider": "elevenlabs",
"asset_id": "re_657929111888723968",
"model_id": "scribe_v2",
"language_code": "en",
"diarize": true,
"timestamps_granularity": "word",
"num_speakers": 2
}'
Response
{
"code": 200,
"msg": "success",
"data": {
"language_code": "en",
"language_probability": 0.98,
"text": "Hello, how are you today? I'm doing well, thank you.",
"words": [
{
"text": "Hello,",
"start": 0.0,
"end": 0.52,
"type": "word",
"speaker_id": "speaker_0"
},
{
"text": " ",
"start": 0.52,
"end": 0.52,
"type": "spacing"
},
{
"text": "how",
"start": 0.52,
"end": 0.78,
"type": "word",
"speaker_id": "speaker_0"
}
]
},
"failed": false,
"success": true
}
Response Parameters
| Parameter | Type | Description |
|---|
| data.language_code | string | Detected language code (ISO 639-1) |
| data.language_probability | number | Confidence of language detection (0.0–1.0) |
| data.text | string | Full transcription text |
| data.words | array[object] | Word-level transcription with timing |
| data.words[].text | string | The word or spacing text |
| data.words[].start | number | Start time in seconds |
| data.words[].end | number | End time in seconds |
| data.words[].type | string | Token type: word, spacing, or audio_event |
| data.words[].speaker_id | string | Speaker identifier (e.g. speaker_0). Only present when diarize=true. |
Timestamps are in seconds (e.g. 0.52).