Whisper Audio Transcription

Product: Visual Intelligence — Audio File Transcription Use case: Transcribe an uploaded audio/video file to text — async batch or sync, multiple providers (Whisper, ElevenLabs, AssemblyAI) with optional speaker labels. For live streams, see Live Audio Transcription. Host: https://mavi-backend.memories.ai/serve/api/v2 Auth: Authorization: sk-mavi-... (no Bearer prefix)

Transcribe speech from audio or video files using OpenAI Whisper. Returns timestamped text segments. Add speaker: true to label each segment by speaker (doubles the price). Use this endpoint when you need fast, cost-effective speech-to-text on your own uploaded assets. For third-party providers with richer features (word-level confidence, entity detection, PII redaction), see ElevenLabs or AssemblyAI.

Pricing:

$0.001/second (without speaker labeling)
$0.002/second (with speaker: true)

Endpoints

Method	Endpoint	Returns
`POST`	`/transcriptions/sync-generate-audio`	Result directly
`POST`	`/transcriptions/async-generate-audio`	`task_id` + webhook callback

Use sync for short clips where you want an immediate result. Use async for long files — you’ll receive the result via webhook when processing completes.

The async endpoint requires a configured webhook URL. See Webhooks Settings and the Webhooks Guide.Without a configured webhook the async endpoint rejects requests with:

HTTP 400 {"code": 400, "msg": "An async request requires at least one webhook.", "data": null}

Error Responses

Verified live against the sync and async endpoints:

// Sync — missing asset_id
HTTP 400 {"code": 400, "msg": "asset_id cannot be null or empty", "data": null}

// Sync / async — unknown asset_id (and many other validation failures)
HTTP 400 {"code": 400, "msg": "Request has exceeded the limit.", "data": null}

The string "Request has exceeded the limit." is shared across multiple failure paths on this endpoint — true rate-limit rejections AND validation failures (unknown asset_id, wrong model, etc.). Branch on HTTP 400 only, don’t try to parse msg to discriminate.

Supported Models

whisper-1

Request Body

Parameter	Type	Required	Description
asset_id	string	Yes	The uploaded audio or video asset ID to transcribe
model	string	No	Transcription model (default: `whisper-1`)
speaker	boolean	No	Enable speaker labeling. Each segment will include `speaker` (e.g., `SPEAKER_00`). Doubles the price (default: `false`).

Code Examples

curl --request POST \
  --url https://mavi-backend.memories.ai/serve/api/v2/transcriptions/sync-generate-audio \
  --header 'Authorization: sk-mavi-...' \
  --header 'Content-Type: application/json' \
  --data '{
    "asset_id": "re_657929111888723968",
    "model": "whisper-1",
    "speaker": false
  }'

Sync Response

{
  "code": 200,
  "msg": "success",
  "data": {
    "model": "whisper-1",
    "items": [
      { "text": "Hello, how are you today?", "start_time": 0.0, "end_time": 2.98 },
      { "text": "I'm doing well, thank you.", "start_time": 2.98, "end_time": 6.78 }
    ]
  },
  "failed": false,
  "success": true
}

Sync Response Parameters

Parameter	Type	Description
data.model	string	Model used (e.g., `whisper-1`)
data.items	array	Transcription segments
data.items[].text	string	Transcribed text for this segment
data.items[].start_time	number	Segment start time in seconds
data.items[].end_time	number	Segment end time in seconds
data.items[].speaker	string	Speaker label (e.g., `SPEAKER_00`). Only present when `speaker=true`.

Async Response

The initial response returns a task_id. Results are delivered to your webhook URL when transcription completes.

{
  "code": 200,
  "msg": "success",
  "data": { "task_id": "ec2449885ba84c4f943a80ff0633158e" },
  "failed": false,
  "success": true
}

Callback Response Parameters

Parameter	Type	Description
data.data.data	array	Transcription segments
data.data.data[].start_time	number	Segment start time in seconds
data.data.data[].end_time	number	Segment end time in seconds
data.data.data[].text	string	Transcribed text
data.data.data[].speaker	string \| null	Speaker label, or `null` if `speaker=false`
data.data.usage_metadata.duration	number	Audio duration in seconds
data.data.usage_metadata.model	string	Model used
task_id	string	Task ID matching the initial response

Get Started

Asset Management

Social Media Scraping

Audio File Transcription

Live Audio Transcription

Video Model APIs

Video Task APIs

Live Video Content Moderation

Live Video Understanding

Image Model APIs

Embeddings

Human ReID & Caption

Reference

Endpoints

Error Responses

Supported Models

Request Body

Code Examples

Sync Response

Sync Response Parameters

Async Response

Callback Response Parameters

Get Started

Asset Management

Social Media Scraping

Audio File Transcription

Live Audio Transcription

Video Model APIs

Video Task APIs

Live Video Content Moderation

Live Video Understanding

Image Model APIs

Embeddings

Human ReID & Caption

Reference

Documentation Index

​Endpoints

​Error Responses

​Supported Models

​Request Body

​Code Examples

​Sync Response

​Sync Response Parameters

​Async Response

​Callback Response Parameters

Endpoints

Error Responses

Supported Models

Request Body

Code Examples

Sync Response

Sync Response Parameters

Async Response

Callback Response Parameters