Skip to main content
POST
/
serve
/
api
/
v2
/
transcriptions
/
speech-to-text
ElevenLabs
curl --request POST \
  --url https://mavi-backend.memories.ai/serve/api/v2/transcriptions/speech-to-text \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "provider": "<string>",
  "asset_id": "<string>",
  "url": "<string>",
  "source_url": "<string>",
  "language_code": "<string>",
  "model_id": "<string>",
  "diarize": true,
  "timestamps_granularity": "<string>",
  "tag_audio_events": true,
  "num_speakers": 123,
  "file_format": "<string>",
  "source_lang": "<string>",
  "target_lang": "<string>"
}
'
{
  "code": 200,
  "msg": "success",
  "data": {
    "language_code": "en",
    "language_probability": 0.98,
    "text": "Hello, how are you today? I'm doing well, thank you.",
    "words": [
      {
        "text": "Hello,",
        "start": 0.0,
        "end": 0.52,
        "type": "word",
        "speaker_id": "speaker_0"
      },
      {
        "text": " ",
        "start": 0.52,
        "end": 0.52,
        "type": "spacing"
      },
      {
        "text": "how",
        "start": 0.52,
        "end": 0.78,
        "type": "word",
        "speaker_id": "speaker_0"
      }
    ]
  },
  "failed": false,
  "success": true
}
Uses ElevenLabs Scribe V2 model. Returns results synchronously.
Pricing: $0.39/hour of audio, billed by actual audio duration (in seconds).

Audio Source

You must provide one of the following (priority: asset_id > url > source_url).

Parameters

Authorization
string
required
API key for authentication (e.g. sk-mai-xxx).
provider
string
default:"elevenlabs"
STT provider. Use elevenlabs for this endpoint.
asset_id
string
The unique identifier of an uploaded audio/video asset (e.g. re_xxx). Resolved to a signed GCS URL.
url
string
A publicly accessible audio URL.
source_url
string
A gs:// GCS path or public HTTP URL. GCS paths are converted to signed URLs automatically.
language_code
string
Language code (ISO 639-1, e.g. en, zh). If omitted, the provider auto-detects the language.
model_id
string
default:"scribe_v2"
Model to use.
diarize
boolean
Enable speaker diarization.
timestamps_granularity
string
Timestamp level: none, segment, or word.
tag_audio_events
boolean
Tag audio events such as music, laughter, applause.
num_speakers
integer
Expected number of speakers (improves diarization).
file_format
string
Audio format hint (e.g. pcm_s16le_16000).
source_lang
string
Source language for translation.
target_lang
string
Target language for translation.

Code Examples

curl --request POST \
  --url https://mavi-backend.memories.ai/serve/api/v2/transcriptions/speech-to-text \
  --header 'Authorization: sk-mai-this_a_test_string_please_use_your_generated_key_during_testing' \
  --header 'Content-Type: application/json' \
  --data '{
    "provider": "elevenlabs",
    "asset_id": "re_657929111888723968",
    "model_id": "scribe_v2",
    "language_code": "en",
    "diarize": true,
    "timestamps_granularity": "word",
    "num_speakers": 2
  }'

Response

{
  "code": 200,
  "msg": "success",
  "data": {
    "language_code": "en",
    "language_probability": 0.98,
    "text": "Hello, how are you today? I'm doing well, thank you.",
    "words": [
      {
        "text": "Hello,",
        "start": 0.0,
        "end": 0.52,
        "type": "word",
        "speaker_id": "speaker_0"
      },
      {
        "text": " ",
        "start": 0.52,
        "end": 0.52,
        "type": "spacing"
      },
      {
        "text": "how",
        "start": 0.52,
        "end": 0.78,
        "type": "word",
        "speaker_id": "speaker_0"
      }
    ]
  },
  "failed": false,
  "success": true
}

Response Parameters

ParameterTypeDescription
data.language_codestringDetected language code (ISO 639-1)
data.language_probabilitynumberConfidence of language detection (0.0–1.0)
data.textstringFull transcription text
data.wordsarray[object]Word-level transcription with timing
data.words[].textstringThe word or spacing text
data.words[].startnumberStart time in seconds
data.words[].endnumberEnd time in seconds
data.words[].typestringToken type: word, spacing, or audio_event
data.words[].speaker_idstringSpeaker identifier (e.g. speaker_0). Only present when diarize=true.
Timestamps are in seconds (e.g. 0.52).