Uses AssemblyAI Universal-2 model. Submits a job and polls until completion, then returns the full result.
Pricing: $0.15/hour of audio, billed by actual audio duration (in seconds).
Audio Source
You must provide one of the following (priority: asset_id > url > source_url).
Parameters
API key for authentication (e.g. sk-mai-xxx).
STT provider. Must be assemblyai.
The unique identifier of an uploaded audio/video asset (e.g. re_xxx). Resolved to a signed GCS URL.
A publicly accessible audio URL.
A gs:// GCS path or public HTTP URL. GCS paths are converted to signed URLs automatically.
Language code (ISO 639-1, e.g. en, zh). If omitted, the provider auto-detects the language.
Format numbers, dates, etc.
Enable speaker diarization.
Expected number of speakers.
Enable automatic language detection.
language_confidence_threshold
Confidence threshold for language detection (0.0–1.0).
Speech recognition model to use.
Speech confidence threshold (0.0–1.0).
Include disfluencies (um, uh, etc.).
Enable sentiment analysis per utterance.
Enable entity detection (names, locations, etc.).
Automatically highlight key phrases.
Enable content safety detection.
Enable IAB topic categorization.
Automatically generate chapters.
Summarization model: informative or conversational.
Summary format: bullets, bullets_verbose, headline, paragraph, or gist.
PII types to redact (e.g. email_address, phone_number, person_name).
PII replacement strategy: hash or entity_name.
Redact PII from audio output.
Redacted audio quality: mp3 or wav.
Filter profanity from transcript.
List of words to boost recognition.
Boost strength: low, default, or high.
Custom spelling corrections.
AssemblyAI webhook callback URL.
Enable multi-channel transcription.
Start transcription from this time (milliseconds).
End transcription at this time (milliseconds).
Enable custom topic detection.
Code Examples
curl --request POST \
--url https://mavi-backend.memories.ai/serve/api/v2/transcriptions/speech-to-text \
--header 'Authorization: sk-mai-this_a_test_string_please_use_your_generated_key_during_testing' \
--header 'Content-Type: application/json' \
--data '{
"provider": "assemblyai",
"asset_id": "re_657929111888723968",
"language_code": "en",
"speaker_labels": true,
"speakers_expected": 2,
"punctuate": true,
"format_text": true
}'
Response
{
"code": 200,
"msg": "success",
"data": {
"id": "9a27d0d5-d2db-448c-823c-f098507789be",
"status": "completed",
"language_code": "en_us",
"audio_url": "https://storage.googleapis.com/...",
"audio_duration": 52,
"text": "Hello, how are you today? I'm doing well, thank you.",
"words": [
{
"text": "Hello,",
"start": 0,
"end": 520,
"confidence": 0.99,
"speaker": "A"
},
{
"text": "how",
"start": 520,
"end": 780,
"confidence": 0.98,
"speaker": "A"
}
],
"utterances": [
{
"confidence": 0.97,
"start": 0,
"end": 2980,
"text": "Hello, how are you today?",
"speaker": "A"
},
{
"confidence": 0.95,
"start": 2980,
"end": 5200,
"text": "I'm doing well, thank you.",
"speaker": "B"
}
],
"confidence": 0.97,
"punctuate": true,
"format_text": true,
"speaker_labels": true,
"speakers_expected": 2
},
"failed": false,
"success": true
}
Response Parameters
| Parameter | Type | Description |
|---|
| data.id | string | AssemblyAI transcript ID |
| data.status | string | Transcript status: completed |
| data.language_code | string | Detected language code |
| data.audio_duration | integer | Audio duration in seconds |
| data.text | string | Full transcription text |
| data.confidence | number | Overall transcription confidence (0.0–1.0) |
| data.words | array[object] | Word-level transcription with timing (milliseconds) |
| data.words[].text | string | The transcribed word |
| data.words[].start | integer | Start time in milliseconds |
| data.words[].end | integer | End time in milliseconds |
| data.words[].confidence | number | Word confidence score |
| data.words[].speaker | string | Speaker label (e.g. A, B). Only present when speaker_labels=true. |
| data.utterances | array[object] | Sentence-level segments (only when speaker_labels=true) |
| data.utterances[].text | string | Utterance text |
| data.utterances[].start | integer | Start time in milliseconds |
| data.utterances[].end | integer | End time in milliseconds |
| data.utterances[].confidence | number | Utterance confidence score |
| data.utterances[].speaker | string | Speaker label |
Timestamps are in milliseconds (e.g. 520).