Speaker Diarization

{
  "code": 200,
  "msg": "success",
  "data": {
    "model": "pyannote",
    "items": [
      { "start": 0.03, "end": 0.13, "speaker": "SPEAKER_01" },
      { "start": 0.13, "end": 13.40, "speaker": "SPEAKER_00" }
    ]
  },
  "failed": false,
  "success": true
}

Product: Visual Intelligence — Audio File Transcription Use case: Transcribe an uploaded audio/video file to text — async batch or sync, multiple providers (Whisper, ElevenLabs, AssemblyAI) with optional speaker labels. For live streams, see Live Audio Transcription. Host: https://mavi-backend.memories.ai/serve/api/v2 Auth: Authorization: sk-mavi-... (no Bearer prefix)

Segment audio or video by speaker using pyannote. Returns timestamped speaker turns labeled SPEAKER_00, SPEAKER_01, etc. — anonymous labels based on voice characteristics, not identity. Need named speakers? Use Multimodal Speaker Recognition, which combines voice + face recognition to identify speakers by name.

Pricing: $0.001/second of audio or video

Endpoints

Method	Endpoint	Returns
`POST`	`/transcriptions/sync-generate-speaker`	Result directly
`POST`	`/transcriptions/async-generate-speaker`	`task_id` + webhook callback

Use sync for short clips. Use async for long files.

The async endpoint requires a configured webhook URL. See Webhooks Settings and the Webhooks Guide.

Request Body

Parameter	Type	Required	Description
asset_id	string	Yes	The uploaded audio or video asset ID

Code Examples

curl --request POST \
  --url https://mavi-backend.memories.ai/serve/api/v2/transcriptions/sync-generate-speaker \
  --header 'Authorization: sk-mavi-...' \
  --header 'Content-Type: application/json' \
  --data '{ "asset_id": "re_657929111888723968" }'

Sync Response

{
  "code": 200,
  "msg": "success",
  "data": {
    "model": "pyannote",
    "items": [
      { "start": 0.03, "end": 0.13, "speaker": "SPEAKER_01" },
      { "start": 0.13, "end": 13.40, "speaker": "SPEAKER_00" }
    ]
  },
  "failed": false,
  "success": true
}

Sync Response Parameters

Parameter	Type	Description
data.model	string	Model used (e.g., `pyannote`)
data.items	array	Speaker segments
data.items[].start	number	Segment start time in seconds
data.items[].end	number	Segment end time in seconds
data.items[].speaker	string	Anonymous speaker label (e.g., `SPEAKER_00`)

Async Response

{
  "code": 200,
  "msg": "success",
  "data": { "task_id": "ec2449885ba84c4f943a80ff0633158e" },
  "failed": false,
  "success": true
}

Callback Response Parameters

Parameter	Type	Description
data.data.data	array	Speaker segments
data.data.data[].start	number	Segment start time in seconds
data.data.data[].end	number	Segment end time in seconds
data.data.data[].speaker	string	Anonymous speaker label
data.data.usage_metadata.duration	number	Total audio duration in seconds
data.data.usage_metadata.model	string	Model used
task_id	string	Task ID matching the initial response

Whisper Audio Transcription Multimodal Speaker Recognition

Get Started

Asset Management

Social Media Scraping

Audio File Transcription

Live Audio Transcription

Video Model APIs

Video Task APIs

Live Video Content Moderation

Live Video Understanding

Image Model APIs

Embeddings

Human ReID & Caption

Reference

Endpoints

Request Body

Code Examples

Sync Response

Sync Response Parameters

Async Response

Callback Response Parameters

Get Started

Asset Management

Social Media Scraping

Audio File Transcription

Live Audio Transcription

Video Model APIs

Video Task APIs

Live Video Content Moderation

Live Video Understanding

Image Model APIs

Embeddings

Human ReID & Caption

Reference

Documentation Index

​Endpoints

​Request Body

​Code Examples

​Sync Response

​Sync Response Parameters

​Async Response

​Callback Response Parameters

Endpoints

Request Body

Code Examples

Sync Response

Sync Response Parameters

Async Response

Callback Response Parameters