ZiB Network

Vision AI

ZiB Vision AI performs deep scene analysis on your video files. The output is a ZibSidecar JSON — a single structured document containing 7 top-level sections: video properties, transcript, subtitles, chapters, scene understanding, content classification, clip suggestions, and thumbnail candidates.

The ZibSidecar is encrypted and stored redundantly on the ZiB network alongside your video. Retrieve it via CDN URL using the sidecar_file_id from the job status response — it is a standard ZiB object, decrypted on-the-fly by the CDN.

Sections
7
Top-level blocks
Tiers
2
standard · hq
Standards
IAB + GARM
Industry-standard
Format
Encrypted JSON
On ZiB network
The compute node decrypts the video in-memory only for the duration of inference. The AES key is never written to disk. The ZibSidecar output is encrypted with a new key before being stored. Node operators cannot access the video content or the AI analysis.

Tiers

Tier
Speed
Quality
standardDefault
DefaultGood — scene description, tags, summary, clip suggestions
hq
SlowerHigh — richer detail, better accuracy on complex visual content

The standard tier is recommended for the majority of use cases. Use hq when richer scene descriptions and higher accuracy on complex visual content justify the additional cost.

Quick Start

Submit a Vision AI job for a file already uploaded to ZiB, poll until complete, then fetch the sidecar as plain JSON.

javascript
// Submit vision AI job
const { job_id } = await zib.submitVisionAI(objectId, 'standard');

// Poll until complete (typically 1–5 min depending on video length)
let status;
do {
  await new Promise(r => setTimeout(r, 3000));
  status = await zib.getComputeStatus(job_id);
} while (status.status !== 'complete' && status.status !== 'failed');

if (status.status === 'failed') {
  throw new Error(status.error_message);
}

// Fetch the sidecar — standard ZiB CDN URL, no special auth needed
const sidecarUrl = zib.getCdnUrl(status.sidecar_file_id);
const sidecar = await fetch(sidecarUrl).then(r => r.json());
console.log(sidecar.scene_understanding.title_suggestion);

Raw HTTP

http
POST /v1/api/compute/{objectId}
Authorization: Bearer ACCESS_KEY:SECRET_KEY
Content-Type: application/json

{
  "vision": "standard"
}
The sidecar is fetched like any other ZiB file — via getCdnUrl(sidecar_file_id). The CDN decrypts on-the-fly and returns plain JSON. No special headers or authentication required.

Analysis Stages

Vision AI runs multiple analysis stages, each contributing specific fields to the ZibSidecar.

01
Per-Keyframe Deep AnalysisEvery 30s by default (configurable)

Each keyframe is analysed independently. Produces: description, setting, presenter_emotion, body_language, engagement_level, on_screen_text, objects, brands_visible, production_markers, lighting_quality, framing_quality, face_visible, face_centered, thumbnail_score (0–1). This is the raw material for the engagement map and thumbnail candidates.

02
Per-Segment Intent AnalysisOnce per transcript segment

Each transcript segment is paired with its nearest keyframe. Produces: segment_intent (hook / payoff / call_to_action / explaining / demonstrating / etc.), viewer_engagement_prediction, is_hook_moment, is_payoff_moment, cta_detected, cta_text, question_posed, key_claim. These fields are merged into each transcript.segments[] entry.

03
Whole-Video SynthesisOnce, after passes 1 and 2

Given all keyframe analyses and the full transcript, the model produces the executive_summary (150–200 words), title_suggestion, description_suggestion, content_format, narrative_arc, target_audience, assumed_knowledge_level, key_topics, unique_value, series_indicators, and monetisation_notes. GARM brand safety scoring also runs at this stage.

ZibSidecar JSON Reference

The complete schema. Every field present in the canonical sidecar is documented below, grouped by top-level section. All timestamp fields appear in both _s (float seconds) and _ms (integer milliseconds) — no consumer ever needs to convert.

Identity & metadata

zib_sidecar_versionstringAlways "1.0" for this schema version.
video_object_idstringZiB object ID of the source video.
generated_atstringISO 8601 timestamp of when analysis completed.
processing_time_msintegerTotal inference time in milliseconds.
models_used.transcriptionstringTier used for transcription. E.g. "recommended".
models_used.visionstringTier used for vision. E.g. "standard".

video — technical properties

video.duration_sfloatDuration in seconds.
video.duration_msintegerDuration in milliseconds.
video.widthintegerVideo width in pixels.
video.heightintegerVideo height in pixels.
video.aspect_ratiostringNormalised ratio string, e.g. "16:9".
video.fpsfloatFrame rate, e.g. 29.97.
video.codecstringVideo codec, e.g. "h264".
video.audio_channelsintegerNumber of audio channels (1 = mono, 2 = stereo).
video.audio_sample_rateintegerAudio sample rate in Hz, e.g. 44100.
video.file_size_bytesintegerSource file size in bytes.
video.bitrate_kbpsintegerOverall bitrate in kbps.

transcript — speech-to-text output

The full speech-to-text output. Every segment entry also contains the intent fields merged in-place.

transcript.languagestringDetected language code, e.g. "en".
transcript.language_confidencefloatLanguage detection confidence 0–1.
transcript.full_textstringComplete transcript as a single string.
transcript.word_countintegerTotal word count.
transcript.speaking_pace_wpmintegerWords per minute of spoken content.
transcript.silence_ratiofloatProportion of video with no speech (0–1).
transcript.multiple_speakersbooleanWhether more than one speaker was detected.
transcript.segmentsarrayTimestamped segments. Each entry below:
segments[].indexintegerZero-based segment index.
segments[].start_sfloatSegment start in seconds.
segments[].end_sfloatSegment end in seconds.
segments[].start_msintegerSegment start in milliseconds.
segments[].end_msintegerSegment end in milliseconds.
segments[].textstringTranscript text for this segment.
segments[].speakerstring"speaker_1", "speaker_2" etc. when multiple speakers detected.
segments[].confidencefloatTranscription confidence score 0–1.
segments[].intentstringPass 2: e.g. "hook", "explaining", "call_to_action", "payoff".
segments[].viewer_engagement_predictionstringPass 2: "high" | "medium" | "low" | "skip_risk".
segments[].is_hook_momentbooleanPass 2: Would this segment work as a short-form clip opener?
segments[].is_payoff_momentbooleanPass 2: Is this a key revelation or satisfying conclusion?
segments[].cta_detectedbooleanPass 2: Call-to-action present in this segment.
segments[].cta_textstring | nullPass 2: CTA text if detected, null otherwise.
segments[].question_posedstring | nullPass 2: Question asked by presenter, null if none.
segments[].key_claimstring | nullPass 2: Most important claim in this segment, null if none.

subtitles

subtitles.languagestringLanguage code, e.g. "en".
subtitles.srt_object_idstringZiB object ID of the stored SRT file.
subtitles.vtt_object_idstringZiB object ID of the stored WebVTT file.
subtitles.srt_inlinestringComplete SRT content as a string — ready to use without a separate fetch.
subtitles.vtt_inlinestringComplete WebVTT content as a string — ready to use without a separate fetch.

chapters — synthesised chapter markers

chaptersarraySynthesised chapter breaks from topic shifts and keyframe analysis.
chapters[].indexintegerZero-based chapter index.
chapters[].titlestringAI-generated chapter title.
chapters[].summarystringOne-sentence description of chapter content.
chapters[].start_sfloatChapter start in seconds.
chapters[].end_sfloatChapter end in seconds.
chapters[].start_msintegerChapter start in milliseconds.
chapters[].end_msintegerChapter end in milliseconds.
chapters[].keywordsstring[]Key topics in this chapter.
chapters[].dominant_intentstringMost common segment intent, e.g. "hook", "explaining".
chapters[].engagement_levelstring"high" | "medium" | "low" based on Pass 2 engagement predictions.
chapters_youtube_formatstringReady-to-paste YouTube description chapter list. E.g. "0:00 Introduction\n2:22 How ZiB Works".

scene_understanding — Pass 3 synthesis

The richest block — whole-video understanding from Pass 3, per-frame data from Pass 1, and the engagement map.

executive_summarystring150–200 word paragraph describing the video for search indexing and content discovery.
title_suggestionstringSEO-optimised title, 60 chars max.
description_suggestionstring250–300 word platform description suitable for YouTube/RiteStream.
content_formatstring"tutorial" | "explainer" | "vlog" | "review" | "interview" | "podcast_clip" | "documentary" | "product_demo" | "news" | "entertainment" | "other".
narrative_arcstring"problem_solution" | "question_answer" | "before_after" | "listicle" | "story" | "demonstration" | "none_apparent".
target_audiencestringSpecific description of intended audience.
assumed_knowledge_levelstring"beginner" | "intermediate" | "advanced" | "expert".
unique_valuestringWhat makes this video worth watching — what does the viewer get.
series_indicatorsstring | nullEvidence this is part of a series, null if standalone.
guest_presentbooleanWhether a guest speaker/interviewee is present.
monetisation_notesstringObservations relevant to ad placement, sponsored segments, competitor mentions.

production block

production.qualitystring"broadcast" | "prosumer" | "consumer" | "ugc".
production.settingstring"home_office" | "studio" | "outdoors" | "office" | "classroom" | "other".
production.cut_stylestring"fast" | "moderate" | "slow" | "static".
production.talking_head_ratiofloatProportion of video showing a presenter face (0–1).
production.has_screen_recordingbooleanScreen recording detected in at least one keyframe.
production.has_brollbooleanB-roll footage detected.
production.has_slidesbooleanPresentation slides detected.
production.has_lower_thirdsbooleanLower-third graphics detected.
production.has_captions_burned_inbooleanHardcoded captions detected on screen.
production.has_graphic_overlaysbooleanGraphic overlays detected.
production.presenter_countintegerNumber of distinct presenters detected.
production.consistent_framingbooleanCamera framing consistent across keyframes.
production.consistent_lightingbooleanLighting consistent across keyframes.

engagement_map

engagement_map.high_engagement_segmentsarraySegments with predicted high viewer attention. Each has start_s, end_s, start_ms, end_ms, reason.
engagement_map.low_engagement_risk_segmentsarraySegments where viewers are likely to skip. Each has start_s, end_s, start_ms, end_ms, reason. Use for ad placement.
engagement_map.hook_momentsarrayIndividual high-impact moments. Each has timestamp_s, timestamp_ms, reason.
engagement_map.payoff_momentsarrayKey revelations or satisfying conclusions. Each has timestamp_s, timestamp_ms, reason.
engagement_map.cta_momentsarrayCall-to-action moments. Each has timestamp_s, timestamp_ms, text, type.

on_screen_text

on_screen_textarrayAll text detected on screen across keyframes, deduplicated and timestamped.
on_screen_text[].timestamp_sfloatTimestamp in seconds.
on_screen_text[].timestamp_msintegerTimestamp in milliseconds.
on_screen_text[].textstringExact text as it appears on screen.
on_screen_text[].typestring"title_card" | "url" | "graphic" | "chyron" | "slide_text" | "other".

keyframes (Pass 1 raw data)

keyframesarrayPer-frame analysis from Pass 1. One entry per extracted keyframe.
keyframes[].timestamp_sfloatFrame timestamp in seconds.
keyframes[].timestamp_msintegerFrame timestamp in milliseconds.
keyframes[].descriptionstringDetailed natural language description of what is happening in frame.
keyframes[].settingstring"home_office" | "studio" | "outdoors" | "office" | "classroom" | "other".
keyframes[].presenter_emotionstring"enthusiastic" | "calm" | "serious" | "humorous" | "concerned" | "neutral".
keyframes[].body_languagestringDescription of posture, gesture, eye contact.
keyframes[].engagement_levelstring"high" | "medium" | "low".
keyframes[].on_screen_textstring[]Text visible in this specific frame.
keyframes[].objectsstring[]Significant objects visible.
keyframes[].brands_visiblestring[]Brand names, logos, or products identified.
keyframes[].production_markersstring[]"lower_thirds" | "chyron" | "broll" | "screen_recording" | "slide" | "graphic_overlay".
keyframes[].lighting_qualitystring"excellent" | "good" | "fair" | "poor".
keyframes[].framing_qualitystring"excellent" | "good" | "fair" | "poor".
keyframes[].face_visiblebooleanA face is visible in this frame.
keyframes[].face_centeredbooleanFace is roughly centred — safe for vertical crop.
keyframes[].thumbnail_scorefloat0–1 score for thumbnail suitability. High face visibility + good lighting + engaged expression = high score.

content_classification

iab_categoriesarrayIAB Content Taxonomy 3.0 categories. Each has id, name, confidence.
iab_categories[].idstringIAB taxonomy ID, e.g. "IAB19" or "IAB19-18".
iab_categories[].namestringHuman name, e.g. "Technology & Computing".
iab_categories[].confidencefloatClassification confidence 0–1.
iab_primarystringTop-level IAB category ID, e.g. "IAB19". Use for DSP/ad-platform category declaration.
tagsstring[]Free-form discovery tags for platform search/recommendation.
keywords_rankedarrayTop keywords with relevance scores. Each has keyword, relevance.
primary_languagestringLanguage code, e.g. "en".
contains_musicbooleanMusic detected in audio.
contains_speechbooleanSpeech detected in audio.
speech_claritystring"high" | "medium" | "low".
accessibility_scorefloat0–1 score based on speech clarity and caption availability.
content_ratingstring"general" | "teen" | "mature" | "adult".
suitable_for_childrenbooleanContent safe for children.
suitable_for_familiesbooleanContent safe for family viewing.

garm — GARM Brand Safety Framework

garm.overall_floorstring"floor_1" | "floor_2" | "floor_3" | "floor_4". floor_1 = most permissive.
garm.suitable_for_adsbooleantrue if video passes floor_1 (minimum acceptable standard).
garm.brand_safety_scoreintegerComposite 0–100 score. 96+ = broadly brand safe.
garm.categories.adult_explicit_sexual_contentobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.arms_ammunitionobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.crime_harmful_actsobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.death_injury_military_conflictobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.online_piracyobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.hate_speech_acts_of_aggressionobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.obscenity_profanityobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.illegal_drugs_tobacco_vapingobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.spam_harmful_contentobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.terrorismobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.debated_sensitive_social_issuesobjectpresent: boolean, floor: null | floor_1–4.

rights

rights.content_typestring"original" | "licensed" | "ugc" | "derivative".
rights.contains_third_party_ipbooleanThird-party intellectual property detected.
rights.detected_brandsstring[]Brand names or logos detected on screen.
rights.detected_musicstring[]Music detected — title/artist if identifiable.
rights.detected_stock_footagebooleanStock footage detected.
rights.copyright_flagsstring[]Content that may require rights clearance.

clip_suggestions

clip_suggestionsarrayShort-form candidates ranked by virality potential.
clip_suggestions[].indexintegerZero-based rank.
clip_suggestions[].start_sfloatClip start in seconds.
clip_suggestions[].end_sfloatClip end in seconds.
clip_suggestions[].start_msintegerClip start in milliseconds.
clip_suggestions[].end_msintegerClip end in milliseconds.
clip_suggestions[].duration_sfloatClip duration in seconds.
clip_suggestions[].titlestringSuggested clip title.
clip_suggestions[].hook_textstringOpening line / hook for this clip.
clip_suggestions[].reasonstringWhy this segment was suggested.
clip_suggestions[].virality_scorefloat0–1 predicted virality score.
clip_suggestions[].short_form_suitablebooleanDuration under 60s and self-contained.
clip_suggestions[].vertical_crop_viablebooleanFace centred — safe to crop 16:9 to 9:16 for Reels/Shorts/TikTok.
clip_suggestions[].family_safebooleanClip content is family-safe.

thumbnail_candidates

thumbnail_candidatesarrayFrames ranked by thumbnail_score from Pass 1 keyframe analysis.
thumbnail_candidates[].rankinteger1-based rank (1 = best).
thumbnail_candidates[].timestamp_sfloatFrame timestamp in seconds.
thumbnail_candidates[].timestamp_msintegerFrame timestamp in milliseconds.
thumbnail_candidates[].scorefloatThumbnail quality score 0–1.
thumbnail_candidates[].reasonstringWhy this frame scores well.
thumbnail_candidates[].frame_object_idstringZiB object ID of the extracted JPEG. Use getCdnUrl() to get the public URL.

Full ZibSidecar example

json
{
  "zib_sidecar_version": "1.0",
  "video_object_id": "zib_abc123",
  "generated_at": "2026-03-17T10:00:00Z",
  "processing_time_ms": 4821,
  "models_used": {
    "transcription": "recommended",
    "vision": "standard"
  },

  "video": {
    "duration_s": 612.4,
    "duration_ms": 612400,
    "width": 1920,
    "height": 1080,
    "aspect_ratio": "16:9",
    "fps": 29.97,
    "codec": "h264",
    "audio_channels": 2,
    "audio_sample_rate": 44100,
    "file_size_bytes": 284720384,
    "bitrate_kbps": 3720
  },

  "transcript": {
    "language": "en",
    "language_confidence": 0.98,
    "full_text": "Welcome back to the channel...",
    "word_count": 1842,
    "speaking_pace_wpm": 148,
    "silence_ratio": 0.12,
    "multiple_speakers": false,
    "segments": [
      {
        "index": 0,
        "start_s": 0.0,   "end_s": 4.2,
        "start_ms": 0,    "end_ms": 4200,
        "text": "Welcome back to the channel.",
        "speaker": "speaker_1",
        "confidence": 0.97,
        "intent": "hook",
        "viewer_engagement_prediction": "high",
        "is_hook_moment": true,
        "is_payoff_moment": false,
        "cta_detected": false,
        "cta_text": null,
        "question_posed": null,
        "key_claim": null
      }
    ]
  },

  "subtitles": {
    "language": "en",
    "srt_object_id": "zib_abc123.en.srt",
    "vtt_object_id": "zib_abc123.en.vtt",
    "srt_inline": "1\n00:00:00,000 --> 00:00:04,200\nWelcome back to the channel.\n\n...",
    "vtt_inline": "WEBVTT\n\n00:00:00.000 --> 00:00:04.200\nWelcome back to the channel.\n\n..."
  },

  "chapters": [
    {
      "index": 0,
      "title": "Introduction",
      "summary": "Host explains privacy limitations of centralised cloud storage.",
      "start_s": 0.0,   "end_s": 142.0,
      "start_ms": 0,    "end_ms": 142000,
      "keywords": ["introduction", "decentralised storage", "privacy"],
      "dominant_intent": "hook",
      "engagement_level": "high"
    }
  ],
  "chapters_youtube_format": "0:00 Introduction\n2:22 How ZiB Works\n6:20 Getting Started",

  "scene_understanding": {
    "executive_summary": "A single presenter in a professional home office...",
    "title_suggestion": "ZiB Network: Private Decentralised Storage Explained",
    "description_suggestion": "Learn how ZiB Network solves the privacy problems of traditional cloud...",
    "content_format": "tutorial",
    "narrative_arc": "problem_solution",
    "target_audience": "Intermediate to advanced software developers.",
    "assumed_knowledge_level": "intermediate",
    "unique_value": "Hands-on technical explanation with concrete failure demonstrations.",
    "series_indicators": "References 'last week's video on encryption' at 2:14.",
    "guest_present": false,
    "monetisation_notes": "No sponsored segments. Strong CTA at 9:45.",

    "production": {
      "quality": "prosumer",
      "setting": "home_office",
      "cut_style": "moderate",
      "talking_head_ratio": 0.78,
      "has_screen_recording": true,
      "has_broll": false,
      "has_slides": false,
      "has_lower_thirds": true,
      "has_captions_burned_in": false,
      "has_graphic_overlays": true,
      "presenter_count": 1,
      "consistent_framing": true,
      "consistent_lighting": true
    },

    "engagement_map": {
      "high_engagement_segments": [
        { "start_s": 0.0, "end_s": 45.0, "start_ms": 0, "end_ms": 45000,
          "reason": "Strong hook, high presenter energy" }
      ],
      "low_engagement_risk_segments": [
        { "start_s": 95.0, "end_s": 142.0, "start_ms": 95000, "end_ms": 142000,
          "reason": "Slow transition, background context" }
      ],
      "hook_moments": [
        { "timestamp_s": 8.4, "timestamp_ms": 8400,
          "reason": "Provocative claim: most cloud storage can be read by the provider" }
      ],
      "payoff_moments": [
        { "timestamp_s": 187.0, "timestamp_ms": 187000,
          "reason": "Live demonstration of node failure recovery" }
      ],
      "cta_moments": [
        { "timestamp_s": 585.0, "timestamp_ms": 585000,
          "text": "Link to GitHub in description", "type": "link" }
      ]
    },

    "on_screen_text": [
      { "timestamp_s": 12.0, "timestamp_ms": 12000,
        "text": "How ZiB Storage Works", "type": "title_card" }
    ],

    "keyframes": [
      {
        "timestamp_s": 0.0, "timestamp_ms": 0,
        "description": "Person seated at standing desk, looking directly at camera.",
        "setting": "home_office",
        "presenter_emotion": "enthusiastic",
        "body_language": "Forward lean, open hand gestures, direct eye contact",
        "engagement_level": "high",
        "on_screen_text": [],
        "objects": ["person", "desk", "monitor", "microphone"],
        "brands_visible": [],
        "production_markers": [],
        "lighting_quality": "good",
        "framing_quality": "excellent",
        "face_visible": true,
        "face_centered": true,
        "thumbnail_score": 0.91
      }
    ]
  },

  "content_classification": {
    "iab_categories": [
      { "id": "IAB19", "name": "Technology & Computing", "confidence": 0.97 },
      { "id": "IAB19-18", "name": "Internet Technology", "confidence": 0.91 }
    ],
    "iab_primary": "IAB19",
    "tags": ["decentralised storage", "ZiB", "encryption", "privacy", "Web3"],
    "keywords_ranked": [
      { "keyword": "decentralised storage", "relevance": 0.96 }
    ],
    "primary_language": "en",
    "contains_music": false,
    "contains_speech": true,
    "speech_clarity": "high",
    "accessibility_score": 0.87,
    "content_rating": "general",
    "suitable_for_children": false,
    "suitable_for_families": false,
    "garm": {
      "overall_floor": "floor_1",
      "suitable_for_ads": true,
      "brand_safety_score": 96,
      "categories": {
        "adult_explicit_sexual_content": { "present": false, "floor": null },
        "arms_ammunition": { "present": false, "floor": null },
        "crime_harmful_acts": { "present": false, "floor": null },
        "death_injury_military_conflict": { "present": false, "floor": null },
        "online_piracy": { "present": false, "floor": null },
        "hate_speech_acts_of_aggression": { "present": false, "floor": null },
        "obscenity_profanity": { "present": false, "floor": null },
        "illegal_drugs_tobacco_vaping": { "present": false, "floor": null },
        "spam_harmful_content": { "present": false, "floor": null },
        "terrorism": { "present": false, "floor": null },
        "debated_sensitive_social_issues": { "present": false, "floor": null }
      }
    },
    "rights": {
      "content_type": "original",
      "contains_third_party_ip": false,
      "detected_brands": [],
      "detected_music": [],
      "detected_stock_footage": false,
      "copyright_flags": []
    }
  },

  "clip_suggestions": [
    {
      "index": 0,
      "start_s": 8.4,   "end_s": 55.2,
      "start_ms": 8400, "end_ms": 55200,
      "duration_s": 46.8,
      "title": "What if your cloud storage provider can read everything you upload?",
      "hook_text": "Most people don't realise that every file you upload to S3 can be read by Amazon.",
      "reason": "Opens with provocative hook, delivers clear explanation, natural conclusion.",
      "virality_score": 0.81,
      "short_form_suitable": true,
      "vertical_crop_viable": true,
      "family_safe": true
    }
  ],

  "thumbnail_candidates": [
    {
      "rank": 1,
      "timestamp_s": 12.4, "timestamp_ms": 12400,
      "score": 0.91,
      "reason": "Clear face, direct eye contact, forward lean, good lighting.",
      "frame_object_id": "zib_abc123.thumb.0001.jpg"
    }
  ]
}

All Timestamps Have Two Formats

Every timed field in the ZibSidecar appears in both _s (float, seconds) and _ms (integer, milliseconds). No consumer ever needs to convert units. Use whichever format matches your platform — video players typically want seconds, ad servers and caption systems typically want milliseconds.
javascript
// Both are always present — pick the unit that matches your system
const segment = sidecar.transcript.segments[0];
segment.start_s   // => 0.0        (float seconds — for video players)
segment.start_ms  // => 0          (integer ms — for ad servers, caption pipelines)

const chapter = sidecar.chapters[0];
chapter.start_s   // => 142.0
chapter.start_ms  // => 142000

const clip = sidecar.clip_suggestions[0];
clip.start_ms     // => 8400       (integer — no floating point rounding to worry about)

const thumb = sidecar.thumbnail_candidates[0];
thumb.timestamp_ms // => 12400     (seek video player to this ms)

Consumer Guide

How downstream platforms map from the ZibSidecar to their own ingest formats. The sidecar is designed to be rich enough that any consumer can extract what they need without a translation layer in ZiB.

RiteStream / Video platforms

javascript
const { scene_understanding, content_classification, chapters } = sidecar;

// Auto-populate upload form
form.title       = scene_understanding.title_suggestion;
form.description = scene_understanding.description_suggestion;
form.tags        = content_classification.tags;

// Paste directly into YouTube description field
form.chapters    = sidecar.chapters_youtube_format;

// Ad placement — brand safety gate
const brandSafe = content_classification.garm.brand_safety_score > 80;
const adFloor   = content_classification.garm.overall_floor; // "floor_1"

// Place mid-roll ad at a natural viewing lull
const adMidrollAt = sidecar.scene_understanding.engagement_map
  .low_engagement_risk_segments[0]?.start_ms;

// Search indexing
const searchDoc = {
  id:          fileId,
  title:       scene_understanding.title_suggestion,
  body:        scene_understanding.executive_summary,
  transcript:  sidecar.transcript.full_text,
  tags:        content_classification.tags,
  iab:         content_classification.iab_primary,
};

Content licensing / AllRites

javascript
// Caption delivery — segments have start_ms and end_ms ready-to-use
const captions = sidecar.transcript.segments.map(s => ({
  start: s.start_ms,
  end:   s.end_ms,
  text:  s.text,
}));

// SRT available inline — no second fetch needed
const srtContent = sidecar.subtitles.srt_inline;
const vttContent = sidecar.subtitles.vtt_inline;

// Content moderation and ad eligibility
const adEligible  = sidecar.content_classification.garm.suitable_for_ads;
const iabCategory = sidecar.content_classification.iab_primary; // e.g. "IAB19"

// Rights and IP detection
const { rights } = sidecar.content_classification;
if (rights.contains_third_party_ip || rights.copyright_flags.length > 0) {
  flagForRightsReview(fileId, rights);
}

Short-form / Clips (Reels, Shorts, TikTok)

javascript
// Find clips suitable for Reels/Shorts
const shortFormClips = sidecar.clip_suggestions.filter(c => c.short_form_suitable);

// Further filter to those safe for vertical crop (9:16)
const verticalClips = shortFormClips.filter(c => c.vertical_crop_viable);

// Each clip has everything needed to create a derivative
verticalClips.forEach(clip => {
  console.log(clip.start_ms, clip.end_ms); // trim points in ms
  console.log(clip.title);                 // suggested title
  console.log(clip.hook_text);             // opening caption
  console.log(clip.virality_score);        // 0-1 predicted virality
});

Professional NLE (Premiere, DaVinci, Final Cut)

javascript
// Import sequence markers from chapters
const markers = sidecar.chapters.map(ch => ({
  name: ch.title,
  in:   ch.start_ms,
  out:  ch.end_ms,
}));

// Caption track import — SRT inline, no extra fetch
const captionTrack = sidecar.subtitles.srt_inline;

// Best poster frame — frame_object_id is a ZiB object, get CDN URL
const bestThumb = sidecar.thumbnail_candidates[0];
const thumbUrl  = zib.getCdnUrl(bestThumb.frame_object_id);

// On-screen text for auto-generated lower thirds
sidecar.scene_understanding.on_screen_text.forEach(item => {
  if (item.type === 'chyron' || item.type === 'title_card') {
    addLowerThird(item.timestamp_ms, item.text);
  }
});

Programmatic ad platforms / DSPs

javascript
const { content_classification: cc } = sidecar;

// IAB Content Taxonomy 3.0 — readable by all major DSPs without translation
const targeting = {
  iab_categories: cc.iab_categories.map(c => c.id), // ["IAB19", "IAB19-18"]
  iab_primary:    cc.iab_primary,
};

// GARM brand safety — same standard used by Snapchat, YouTube, all DSPs
const brandSafety = {
  floor:          cc.garm.overall_floor,  // "floor_1"
  score:          cc.garm.brand_safety_score,
  suitable:       cc.garm.suitable_for_ads,
  categories:     cc.garm.categories,    // 11 GARM categories, each with floor
};

// Engagement-driven ad slot selection
const adSlots = sidecar.scene_understanding.engagement_map
  .low_engagement_risk_segments
  .map(seg => ({ in_ms: seg.start_ms, out_ms: seg.end_ms, reason: seg.reason }));

Custom Output Schema

Partners who need a specific field mapping can provide a JSON template with {{dot.notation.paths}} placeholders. The backend substitutes values from the ZibSidecar before returning, so partners define their own ingest format as config without any code changes in ZiB.

http
POST /v1/api/compute/{objectId}
Authorization: Bearer ACCESS_KEY:SECRET_KEY
Content-Type: application/json

{
  "vision": "standard",
  "output_schema": {
    "my_title":    "{{scene_understanding.title_suggestion}}",
    "duration_ms": "{{video.duration_ms}}",
    "brand_safe":  "{{content_classification.garm.suitable_for_ads}}",
    "top_tag":     "{{content_classification.tags.0}}",
    "floor":       "{{content_classification.garm.overall_floor}}",
    "iab":         "{{content_classification.iab_primary}}"
  }
}

The response will be the template with placeholders substituted from the ZibSidecar — not the full sidecar. Array elements are accessed with dot-index notation: {{content_classification.tags.0}} returns the first tag.

When no output_schema is provided, the full ZibSidecar is returned. Use output_schema only when your downstream ingest system expects a specific shape and you want ZiB to do the mapping at compute time.

IAB Content Taxonomy 3.0

ZiB uses the IAB Content Taxonomy 3.0. Category IDs like IAB19 and IAB19-18 are the actual standard used by all major ad platforms, SSPs, and DSPs. The sidecar outputs real taxonomy IDs — no translation layer is needed for DSP integration. The full taxonomy is maintained at github.com/InteractiveAdvertisingBureau/Taxonomies.

The iab_primary field is the top-level category (e.g. IAB19). The iab_categories array includes subcategory IDs and confidence scores. All categories with confidence above 0.6 are included. Declare the iab_primary value directly in your DSP or ad platform content category field — no mapping required.

GARM Brand Safety

The 11 GARM categories and floor_1–floor_4 floor levels in the sidecar are the GARM Brand Safety Floor & Suitability Framework, used by Snapchat, YouTube, and all major DSPs. The framework is maintained at wfanet.org.

Floor levels

Floor
Meaning
Typical use
floor_1Minimum acceptableMost brand advertising. suitable_for_ads: true
floor_2Low suitabilityRestricted categories only (alcohol, gambling)
floor_3High riskNot suitable for standard brand campaigns
floor_4UnsafeBlock entirely. suitable_for_ads: false

brand_safety_score is a composite 0–100 score. A score of 96+ means the content is broadly brand safe across all 11 categories. suitable_for_ads: true means the video passes floor_1 — the industry minimum standard for advertising suitability. Each of the 11 categories also carries an individual floor level, so ad platforms can apply their own category-level exclusion rules independently of the overall score.