Use this file to discover all available pages before exploring further.
Product: Visual Search
Use case: Upload videos and images, auto-index them, then search by natural language, image, or transcript phrase
Host: https://api.memories.ai/serve/api/v1Auth: Authorization: sk-mavi-... (no Bearer prefix)
Retrieve the visual caption of a video — a chronological list of scene descriptions produced by the indexing pipeline. Each segment carries a time range and a natural-language description of what is happening on screen during that window. For the spoken-words transcription, use Get Audio Transcription.
{ "code": "0000", "msg": "success", "data": { "videoNo": "VI702915390254350336", "transcriptions": [ { "index": 0, "content": "A person's hand is seen holding a white object, possibly a phone, near a wooden door. The camera pans slightly to reveal a wall with coats hanging on a rack.", "startTime": "0", "endTime": "4" }, { "index": 1, "content": "The camera moves past a wooden door, revealing a hallway with another door at the end. A laundry basket is visible on the right.", "startTime": "4", "endTime": "7" } ], "createTime": "1777047288621", "video_bucket": "mavi-resource", "video_blob": "VI702915390254350336.mp4" }, "success": true, "failed": false}
Availability: data.transcriptions is populated by the indexing pipeline. If the video has not yet reached status: PARSE, the call may return data: null or an empty transcriptions array — poll Get Metadata until parsing completes before depending on the result.
Numeric strings: startTime, endTime, and createTime are strings — cast with int(...) before arithmetic.
Rate limiting: Subject to the standard Visual Search rate limits. See Rate limits.