Documentation Index
Fetch the complete documentation index at: https://api-tools.memories.ai/llms.txt
Use this file to discover all available pages before exploring further.
Product: Visual Intelligence — Audio File Transcription
Use case: Transcribe an uploaded audio/video file to text — async batch or sync, multiple providers (Whisper, ElevenLabs, AssemblyAI) with optional speaker labels. For live streams, see Live Audio Transcription.
Host:
https://mavi-backend.memories.ai/serve/api/v2
Auth: Authorization: sk-mavi-... (no Bearer prefix)speaker: true to label each segment by speaker (doubles the price).
Use this endpoint when you need fast, cost-effective speech-to-text on your own uploaded assets. For third-party providers with richer features (word-level confidence, entity detection, PII redaction), see ElevenLabs or AssemblyAI.
Pricing:
- $0.001/second (without speaker labeling)
- $0.002/second (with
speaker: true)
Endpoints
| Method | Endpoint | Returns |
|---|---|---|
POST | /transcriptions/sync-generate-audio | Result directly |
POST | /transcriptions/async-generate-audio | task_id + webhook callback |
Error Responses
Verified live against the sync and async endpoints:The string
"Request has exceeded the limit." is shared across multiple failure paths on this endpoint — true rate-limit rejections AND validation failures (unknown asset_id, wrong model, etc.). Branch on HTTP 400 only, don’t try to parse msg to discriminate.Supported Models
whisper-1
Request Body
| Parameter | Type | Required | Description |
|---|---|---|---|
| asset_id | string | Yes | The uploaded audio or video asset ID to transcribe |
| model | string | No | Transcription model (default: whisper-1) |
| speaker | boolean | No | Enable speaker labeling. Each segment will include speaker (e.g., SPEAKER_00). Doubles the price (default: false). |
Code Examples
Sync Response
Sync Response Parameters
| Parameter | Type | Description |
|---|---|---|
| data.model | string | Model used (e.g., whisper-1) |
| data.items | array | Transcription segments |
| data.items[].text | string | Transcribed text for this segment |
| data.items[].start_time | number | Segment start time in seconds |
| data.items[].end_time | number | Segment end time in seconds |
| data.items[].speaker | string | Speaker label (e.g., SPEAKER_00). Only present when speaker=true. |
Async Response
The initial response returns atask_id. Results are delivered to your webhook URL when transcription completes.
Callback Response Parameters
| Parameter | Type | Description |
|---|---|---|
| data.data.data | array | Transcription segments |
| data.data.data[].start_time | number | Segment start time in seconds |
| data.data.data[].end_time | number | Segment end time in seconds |
| data.data.data[].text | string | Transcribed text |
| data.data.data[].speaker | string | null | Speaker label, or null if speaker=false |
| data.data.usage_metadata.duration | number | Audio duration in seconds |
| data.data.usage_metadata.model | string | Model used |
| task_id | string | Task ID matching the initial response |
