Skip to main content
POST
/
serve
/
api
/
v2
/
transcriptions
/
speech-to-text
AssemblyAI
curl --request POST \
  --url https://mavi-backend.memories.ai/serve/api/v2/transcriptions/speech-to-text \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "provider": "<string>",
  "asset_id": "<string>",
  "url": "<string>",
  "source_url": "<string>",
  "language_code": "<string>",
  "punctuate": true,
  "format_text": true,
  "speaker_labels": true,
  "speakers_expected": 123,
  "language_detection": true,
  "language_confidence_threshold": 123,
  "speech_model": "<string>",
  "speech_threshold": 123,
  "disfluencies": true,
  "sentiment_analysis": true,
  "entity_detection": true,
  "auto_highlights": true,
  "content_safety": true,
  "iab_categories": true,
  "auto_chapters": true,
  "summarization": true,
  "summary_model": "<string>",
  "summary_type": "<string>",
  "redact_pii": true,
  "redact_pii_policies": [
    "<string>"
  ],
  "redact_pii_sub": "<string>",
  "redact_pii_audio": true,
  "redact_pii_audio_quality": "<string>",
  "filter_profanity": true,
  "word_boost": [
    "<string>"
  ],
  "boost_param": "<string>",
  "custom_spelling": [
    {}
  ],
  "webhook_url": "<string>",
  "multichannel": true,
  "audio_start_from": 123,
  "audio_end_at": 123,
  "custom_topics": true,
  "topics": [
    "<string>"
  ]
}
'
{
  "code": 200,
  "msg": "success",
  "data": {
    "id": "9a27d0d5-d2db-448c-823c-f098507789be",
    "status": "completed",
    "language_code": "en_us",
    "audio_url": "https://storage.googleapis.com/...",
    "audio_duration": 52,
    "text": "Hello, how are you today? I'm doing well, thank you.",
    "words": [
      {
        "text": "Hello,",
        "start": 0,
        "end": 520,
        "confidence": 0.99,
        "speaker": "A"
      },
      {
        "text": "how",
        "start": 520,
        "end": 780,
        "confidence": 0.98,
        "speaker": "A"
      }
    ],
    "utterances": [
      {
        "confidence": 0.97,
        "start": 0,
        "end": 2980,
        "text": "Hello, how are you today?",
        "speaker": "A"
      },
      {
        "confidence": 0.95,
        "start": 2980,
        "end": 5200,
        "text": "I'm doing well, thank you.",
        "speaker": "B"
      }
    ],
    "confidence": 0.97,
    "punctuate": true,
    "format_text": true,
    "speaker_labels": true,
    "speakers_expected": 2
  },
  "failed": false,
  "success": true
}
Uses AssemblyAI Universal-2 model. Submits a job and polls until completion, then returns the full result.
Pricing: $0.15/hour of audio, billed by actual audio duration (in seconds).

Audio Source

You must provide one of the following (priority: asset_id > url > source_url).

Parameters

Authorization
string
required
API key for authentication (e.g. sk-mai-xxx).
provider
string
required
STT provider. Must be assemblyai.
asset_id
string
The unique identifier of an uploaded audio/video asset (e.g. re_xxx). Resolved to a signed GCS URL.
url
string
A publicly accessible audio URL.
source_url
string
A gs:// GCS path or public HTTP URL. GCS paths are converted to signed URLs automatically.
language_code
string
Language code (ISO 639-1, e.g. en, zh). If omitted, the provider auto-detects the language.
punctuate
boolean
default:"true"
Add punctuation.
format_text
boolean
default:"true"
Format numbers, dates, etc.
speaker_labels
boolean
Enable speaker diarization.
speakers_expected
integer
Expected number of speakers.
language_detection
boolean
Enable automatic language detection.
language_confidence_threshold
number
Confidence threshold for language detection (0.0–1.0).
speech_model
string
Speech recognition model to use.
speech_threshold
number
Speech confidence threshold (0.0–1.0).
disfluencies
boolean
Include disfluencies (um, uh, etc.).
sentiment_analysis
boolean
Enable sentiment analysis per utterance.
entity_detection
boolean
Enable entity detection (names, locations, etc.).
auto_highlights
boolean
Automatically highlight key phrases.
content_safety
boolean
Enable content safety detection.
iab_categories
boolean
Enable IAB topic categorization.
auto_chapters
boolean
Automatically generate chapters.
summarization
boolean
Enable summarization.
summary_model
string
Summarization model: informative or conversational.
summary_type
string
Summary format: bullets, bullets_verbose, headline, paragraph, or gist.
redact_pii
boolean
Enable PII redaction.
redact_pii_policies
string[]
PII types to redact (e.g. email_address, phone_number, person_name).
redact_pii_sub
string
PII replacement strategy: hash or entity_name.
redact_pii_audio
boolean
Redact PII from audio output.
redact_pii_audio_quality
string
Redacted audio quality: mp3 or wav.
filter_profanity
boolean
Filter profanity from transcript.
word_boost
string[]
List of words to boost recognition.
boost_param
string
Boost strength: low, default, or high.
custom_spelling
object[]
Custom spelling corrections.
webhook_url
string
AssemblyAI webhook callback URL.
multichannel
boolean
Enable multi-channel transcription.
audio_start_from
integer
Start transcription from this time (milliseconds).
audio_end_at
integer
End transcription at this time (milliseconds).
custom_topics
boolean
Enable custom topic detection.
topics
string[]
Custom topic labels.

Code Examples

curl --request POST \
  --url https://mavi-backend.memories.ai/serve/api/v2/transcriptions/speech-to-text \
  --header 'Authorization: sk-mai-this_a_test_string_please_use_your_generated_key_during_testing' \
  --header 'Content-Type: application/json' \
  --data '{
    "provider": "assemblyai",
    "asset_id": "re_657929111888723968",
    "language_code": "en",
    "speaker_labels": true,
    "speakers_expected": 2,
    "punctuate": true,
    "format_text": true
  }'

Response

{
  "code": 200,
  "msg": "success",
  "data": {
    "id": "9a27d0d5-d2db-448c-823c-f098507789be",
    "status": "completed",
    "language_code": "en_us",
    "audio_url": "https://storage.googleapis.com/...",
    "audio_duration": 52,
    "text": "Hello, how are you today? I'm doing well, thank you.",
    "words": [
      {
        "text": "Hello,",
        "start": 0,
        "end": 520,
        "confidence": 0.99,
        "speaker": "A"
      },
      {
        "text": "how",
        "start": 520,
        "end": 780,
        "confidence": 0.98,
        "speaker": "A"
      }
    ],
    "utterances": [
      {
        "confidence": 0.97,
        "start": 0,
        "end": 2980,
        "text": "Hello, how are you today?",
        "speaker": "A"
      },
      {
        "confidence": 0.95,
        "start": 2980,
        "end": 5200,
        "text": "I'm doing well, thank you.",
        "speaker": "B"
      }
    ],
    "confidence": 0.97,
    "punctuate": true,
    "format_text": true,
    "speaker_labels": true,
    "speakers_expected": 2
  },
  "failed": false,
  "success": true
}

Response Parameters

ParameterTypeDescription
data.idstringAssemblyAI transcript ID
data.statusstringTranscript status: completed
data.language_codestringDetected language code
data.audio_durationintegerAudio duration in seconds
data.textstringFull transcription text
data.confidencenumberOverall transcription confidence (0.0–1.0)
data.wordsarray[object]Word-level transcription with timing (milliseconds)
data.words[].textstringThe transcribed word
data.words[].startintegerStart time in milliseconds
data.words[].endintegerEnd time in milliseconds
data.words[].confidencenumberWord confidence score
data.words[].speakerstringSpeaker label (e.g. A, B). Only present when speaker_labels=true.
data.utterancesarray[object]Sentence-level segments (only when speaker_labels=true)
data.utterances[].textstringUtterance text
data.utterances[].startintegerStart time in milliseconds
data.utterances[].endintegerEnd time in milliseconds
data.utterances[].confidencenumberUtterance confidence score
data.utterances[].speakerstringSpeaker label
Timestamps are in milliseconds (e.g. 520).