Skip to main content
POST
/
transcriptions
/
async-generate-multi-speaker
Async Generate Multi Speaker
curl --request POST \
  --url https://mavi-backend.memories.ai/serve/api/v2/transcriptions/async-generate-multi-speaker \
  --header 'Authorization: <api-key>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "asset_id": "re_657929111888723968"
}
'
{
  "code": 200,
  "msg": "success",
  "data": {
    "task_id": "ec2449885ba84c4f943a80ff0633158e"
  },
  "failed": false,
  "success": true
}

Documentation Index

Fetch the complete documentation index at: https://api-tools.memories.ai/llms.txt

Use this file to discover all available pages before exploring further.

Product: Visual Intelligence — Audio File Transcription Use case: Transcribe an uploaded audio/video file to text — async batch or sync, multiple providers (Whisper, ElevenLabs, AssemblyAI) with optional speaker labels. For live streams, see Live Audio Transcription. Host: https://mavi-backend.memories.ai/serve/api/v2 Auth: Authorization: sk-mavi-... (no Bearer prefix)
Identifies who said what by combining audio speaker diarization (pyannote) with face recognition (Gemini). Returns transcription segments labeled with real names alongside face images extracted from the video. Different from Speaker Diarization: Speaker Diarization only assigns anonymous labels (SPEAKER_00, SPEAKER_01). Multimodal Speaker Recognition matches voices to faces to produce named speaker identification — but costs ~100× more.
This is an async endpoint. You must configure a webhook URL in Webhooks Settings before calling this endpoint, otherwise you will not receive the processing results. See Webhooks Configuration Guide for details.
Pricing:
  • $0.1/min of video and audio

Request Body

ParameterTypeRequiredDescription
asset_idstringYesThe unique identifier of the video or audio asset for multi-speaker identification

Code Example

curl --request POST \
  --url https://mavi-backend.memories.ai/serve/api/v2/transcriptions/async-generate-multi-speaker \
  --header 'Authorization: sk-mavi-...' \
  --header 'Content-Type: application/json' \
  --data '{
    "asset_id": "re_657929111888723968"
  }'

Response

Returns the multi-speaker identification task information.
{
  "code": 200,
  "msg": "success",
  "data": {
    "task_id": "ec2449885ba84c4f943a80ff0633158e"
  },
  "failed": false,
  "success": true
}

Response Parameters

ParameterTypeDescription
codestringResponse code indicating the result status
msgstringResponse message describing the operation result
dataobjectResponse data object containing task information
data.task_idstringUnique identifier of the multi-speaker identification task
successbooleanIndicates whether the operation was successful
failedbooleanIndicates whether the operation failed

Callback Response Parameters

When the multi-speaker identification is complete, a callback will be sent to your configured webhook URL.
ParameterTypeDescription
codestringResponse code (200 indicates success)
messagestringStatus message (e.g., “SUCCESS”)
dataobjectResponse data object containing the multimodal ASR result and metadata
data.dataobjectInner data object containing transcription, faces, and usage information
data.data.audio_transcriptionarrayArray of transcription segments with speaker identification
data.data.audio_transcription[].start_timenumberStart time of the segment in seconds
data.data.audio_transcription[].end_timenumberEnd time of the segment in seconds
data.data.audio_transcription[].speakerstringIdentified speaker name
data.data.audio_transcription[].textstringTranscription text for this segment
data.data.facesarrayArray of detected faces with metadata
data.data.faces[].face_idstringUnique identifier for the detected face
data.data.faces[].namestringIdentified name of the person
data.data.faces[].face_file_protocolstringStorage protocol (e.g., “gs” for Google Cloud Storage)
data.data.faces[].face_file_bucketstringStorage bucket name
data.data.faces[].face_file_blobstringFile path in the storage bucket
data.data.usage_metadataarrayArray of usage statistics for different models used
data.data.usage_metadata[].durationnumberProcessing duration in seconds
data.data.usage_metadata[].modelstringThe AI model used (e.g., “gemini-2.5-pro”, “gemini-2.5-flash”)
data.data.usage_metadata[].output_tokensintegerNumber of tokens in the output
data.data.usage_metadata[].prompt_tokensintegerNumber of tokens in the input prompt
data.msgstringDetailed message about the operation result
data.successbooleanIndicates whether the multimodal ASR was successful
task_idstringThe task ID associated with this multi-speaker identification request

Authorizations

Authorization
string
header
required

Body

application/json
asset_id
string
required

The asset ID to identify multiple speakers for

Example:

"re_657929111888723968"

Response

200 - application/json

Multi-speaker identification task information

code
string

Response code indicating the result status

Example:

200

msg
string

Response message describing the operation result

Example:

"success"

data
object

Response data object containing task information

success
boolean

Indicates whether the operation was successful

Example:

true

failed
boolean

Indicates whether the operation failed

Example:

false