ZiB Network

Vision AI

ZiB Vision AI performs deep scene analysis on your video files. The output is a ZibSidecar JSON (zib_sidecar_version: "2.0") — a single structured document with ten top-level blocks: video, transcript, subtitles, chapters, scene_understanding, content_classification, clip_suggestions, thumbnail_candidates, and the v2.0 commercial-intelligence blocks ad_targeting and ai_retrieval.

Schema v2.0 adds an inferential pass on top of the observational analysis: brand-safe ad targeting (ad_targeting), IAB Content Taxonomy 3.0 with tiers and descriptive vectors (content_classification.iab_v3), and viewer intent / audience signals (scene_understanding.intent and .audience). Branch on zib_sidecar_version if you also consume older v1.0 sidecars.

The ZibSidecar is encrypted and stored redundantly on the ZiB network alongside your video. Retrieve it via CDN URL using the sidecar_file_id from the job status response — it is a standard ZiB object, decrypted on-the-fly by the CDN.

Schema
v2.0
Observational + inferential
Tiers
2
standard · hq
Standards
IAB + GARM
Industry-standard
Format
Encrypted JSON
On ZiB network
The compute node decrypts the video in-memory only for the duration of inference. The AES key is never written to disk. The ZibSidecar output is encrypted with a new key before being stored. Node operators cannot access the video content or the AI analysis.

Tiers

Tier
Speed
Quality
standardDefault
DefaultGood — scene description, tags, summary, clip suggestions
hq
SlowerHigh — richer detail, better accuracy on complex visual content

The standard tier is recommended for the majority of use cases. Use hq when richer scene descriptions and higher accuracy on complex visual content justify the additional cost.

Quick Start

Submit a Vision AI job for a file already uploaded to ZiB, poll until complete, then fetch the sidecar as plain JSON.

javascript
// Submit vision AI job
const { job_id } = await zib.submitVisionAI(objectId, 'standard');

// Poll until complete (typically 1–5 min depending on video length)
let status;
do {
  await new Promise(r => setTimeout(r, 3000));
  status = await zib.getComputeStatus(job_id);
} while (status.status !== 'complete' && status.status !== 'failed');

if (status.status === 'failed') {
  throw new Error(status.error_message);
}

// Fetch the sidecar — standard ZiB CDN URL, no special auth needed
const sidecarUrl = zib.getCdnUrl(status.sidecar_file_id);
const sidecar = await fetch(sidecarUrl).then(r => r.json());
console.log(sidecar.scene_understanding.title_suggestion);

Raw HTTP

http
POST /v1/api/compute/{objectId}
Authorization: Bearer ACCESS_KEY:SECRET_KEY
Content-Type: application/json

{
  "vision": "standard"
}
The sidecar is fetched like any other ZiB file — via getCdnUrl(sidecar_file_id). The CDN decrypts on-the-fly and returns plain JSON. No special headers or authentication required.

Analysis Pipeline

Vision AI runs a seven-pass Qwen2.5-VL pipeline. Passes 1–3 are observational (what the model sees); passes 4a–4c are the v2.0 inferential layer (what the model concludes), split into three focused calls so each has its own token budget and never truncates mid-JSON.

01
Per-Keyframe Deep AnalysisPer extracted keyframe

Each keyframe is analysed independently. Produces: description, setting, presenter_emotion, body_language, engagement_level, on_screen_text, objects, brands_visible, production_markers, lighting_quality, framing_quality, face_visible, face_centered, thumbnail_score (0–1). Raw material for the engagement map and thumbnail candidates.

02
Per-Segment Intent AnalysisPer transcript segment

Each transcript segment is paired with its nearest keyframe. Produces: intent (hook / payoff / call_to_action / explaining / demonstrating / etc.), viewer_engagement_prediction, is_hook_moment, is_payoff_moment, cta_detected, cta_text, question_posed, key_claim. Merged into each transcript.segments[] entry.

3a
Narrative SynthesisOnce, after passes 1–2

Whole-video understanding from all keyframes + the full transcript: executive_summary (150–200 words), title_suggestion, description_suggestion, content_format, narrative_arc, target_audience, assumed_knowledge_level, unique_value, series_indicators, monetisation_notes, and the production analysis block.

3b
Chapters, Engagement & ClipsOnce, after 3a

Synthesises chapters + chapters_youtube_format, the engagement_map (high/low engagement, hook/payoff/cta moments), clip_suggestions ranked by virality_score, and thumbnail_candidates ranked from Pass 1 thumbnail scores.

4a
Ad Targeting (inferential)v2.0 inferential

Populates the top-level ad_targeting block: contextual_keywords, brand_affinity, competitor_exclusions, ad_moment_candidates (timestamped, scored), cpm_floor_suggestion (floor_usd + confidence + rationale), suitable_for_ads, recommended_ad_categories, monetisation_notes.

4b
Brand Safety & Classification (inferential)v2.0 inferential

Populates content_classification.garm (GARM floor + brand_safety_score + 11 categories), content_classification.iab_v3 (IAB Content Taxonomy 3.0 categories with unique_id/label/tier/confidence, primary_category, descriptive_vectors), and content signals (contains_music, contains_speech, content_rating, suitable_for_children/families).

4c
Intent & Audience (inferential)v2.0 inferential

Populates scene_understanding.intent (viewer_intent, content_intent, call_to_action, emotional_tone, controversy_score) and scene_understanding.audience (description, psychographic_tags, age_skew, gender_skew, affluence_signal, knowledge_level, viewing_context).

ZibSidecar JSON Reference

The complete schema. Every field present in the canonical sidecar is documented below, grouped by top-level section. All timestamp fields appear in both _s (float seconds) and _ms (integer milliseconds) — no consumer ever needs to convert.

Identity & metadata

zib_sidecar_versionstringSchema version. Current pipeline emits "2.0". Older sidecars may be "1.0" (no inferential blocks).
video_object_idstringZiB object ID (file_id) of the source video.
generated_atstringISO 8601 timestamp of when analysis completed.
processing_time_msintegerTotal inference time in milliseconds.
models_used.transcriptionstring | nullTier used for transcription. E.g. "recommended". null if transcription was not requested.
models_used.visionstring | nullTier used for vision. E.g. "standard" or "hq".
models_used.inferencestring | nullModel used for the v2.0 inferential passes (4a–4c), when present.

video — technical properties

video.duration_sfloatDuration in seconds.
video.duration_msintegerDuration in milliseconds.
video.widthintegerVideo width in pixels.
video.heightintegerVideo height in pixels.
video.aspect_ratiostringNormalised ratio string, e.g. "16:9".
video.fpsfloatFrame rate, e.g. 29.97.
video.codecstringVideo codec, e.g. "h264".
video.audio_channelsintegerNumber of audio channels (1 = mono, 2 = stereo).
video.audio_sample_rateintegerAudio sample rate in Hz, e.g. 44100.
video.file_size_bytesintegerSource file size in bytes.
video.bitrate_kbpsintegerOverall bitrate in kbps.

transcript — speech-to-text output

The full speech-to-text output. Every segment entry also contains the intent fields merged in-place.

transcript.languagestringDetected language code, e.g. "en".
transcript.language_confidencefloatLanguage detection confidence 0–1.
transcript.full_textstringComplete transcript as a single string.
transcript.word_countintegerTotal word count.
transcript.speaking_pace_wpmintegerWords per minute of spoken content.
transcript.silence_ratiofloatProportion of video with no speech (0–1).
transcript.multiple_speakersbooleanWhether more than one speaker was detected.
transcript.segmentsarrayTimestamped segments. Each entry below:
segments[].indexintegerZero-based segment index.
segments[].start_sfloatSegment start in seconds.
segments[].end_sfloatSegment end in seconds.
segments[].start_msintegerSegment start in milliseconds.
segments[].end_msintegerSegment end in milliseconds.
segments[].textstringTranscript text for this segment.
segments[].speakerstring"speaker_1", "speaker_2" etc. when multiple speakers detected.
segments[].confidencefloatTranscription confidence score 0–1.
segments[].intentstringPass 2: e.g. "hook", "explaining", "call_to_action", "payoff".
segments[].viewer_engagement_predictionstringPass 2: "high" | "medium" | "low" | "skip_risk".
segments[].is_hook_momentbooleanPass 2: Would this segment work as a short-form clip opener?
segments[].is_payoff_momentbooleanPass 2: Is this a key revelation or satisfying conclusion?
segments[].cta_detectedbooleanPass 2: Call-to-action present in this segment.
segments[].cta_textstring | nullPass 2: CTA text if detected, null otherwise.
segments[].question_posedstring | nullPass 2: Question asked by presenter, null if none.
segments[].key_claimstring | nullPass 2: Most important claim in this segment, null if none.

subtitles

subtitles.languagestringLanguage code, e.g. "en".
subtitles.srt_object_idstringZiB object ID of the stored SRT file.
subtitles.vtt_object_idstringZiB object ID of the stored WebVTT file.
subtitles.srt_inlinestringComplete SRT content as a string — ready to use without a separate fetch.
subtitles.vtt_inlinestringComplete WebVTT content as a string — ready to use without a separate fetch.

chapters — synthesised chapter markers

chaptersarraySynthesised chapter breaks from topic shifts and keyframe analysis.
chapters[].indexintegerZero-based chapter index.
chapters[].titlestringAI-generated chapter title.
chapters[].summarystringOne-sentence description of chapter content.
chapters[].start_sfloatChapter start in seconds.
chapters[].end_sfloatChapter end in seconds.
chapters[].start_msintegerChapter start in milliseconds.
chapters[].end_msintegerChapter end in milliseconds.
chapters[].keywordsstring[]Key topics in this chapter.
chapters[].dominant_intentstringMost common segment intent, e.g. "hook", "explaining".
chapters[].engagement_levelstring"high" | "medium" | "low" based on Pass 2 engagement predictions.
chapters_youtube_formatstringReady-to-paste YouTube description chapter list. E.g. "0:00 Introduction\n2:22 How ZiB Works".

scene_understanding — Pass 3 synthesis

The richest block — whole-video understanding from Pass 3, per-frame data from Pass 1, and the engagement map.

executive_summarystring150–200 word paragraph describing the video for search indexing and content discovery.
title_suggestionstringSEO-optimised title, 60 chars max.
description_suggestionstring250–300 word platform description suitable for YouTube/RiteStream.
content_formatstring"tutorial" | "explainer" | "vlog" | "review" | "interview" | "podcast_clip" | "documentary" | "product_demo" | "news" | "entertainment" | "other".
narrative_arcstring"problem_solution" | "question_answer" | "before_after" | "listicle" | "story" | "demonstration" | "none_apparent".
target_audiencestringSpecific description of intended audience.
assumed_knowledge_levelstring"beginner" | "intermediate" | "advanced" | "expert".
unique_valuestringWhat makes this video worth watching — what does the viewer get.
series_indicatorsstring | nullEvidence this is part of a series, null if standalone.
guest_presentbooleanWhether a guest speaker/interviewee is present.
monetisation_notesstringObservations relevant to ad placement, sponsored segments, competitor mentions.

production block

production.qualitystring"broadcast" | "prosumer" | "consumer" | "ugc".
production.settingstring"home_office" | "studio" | "outdoors" | "office" | "classroom" | "other".
production.cut_stylestring"fast" | "moderate" | "slow" | "static".
production.talking_head_ratiofloatProportion of video showing a presenter face (0–1).
production.has_screen_recordingbooleanScreen recording detected in at least one keyframe.
production.has_brollbooleanB-roll footage detected.
production.has_slidesbooleanPresentation slides detected.
production.has_lower_thirdsbooleanLower-third graphics detected.
production.has_captions_burned_inbooleanHardcoded captions detected on screen.
production.has_graphic_overlaysbooleanGraphic overlays detected.
production.presenter_countintegerNumber of distinct presenters detected.
production.consistent_framingbooleanCamera framing consistent across keyframes.
production.consistent_lightingbooleanLighting consistent across keyframes.

engagement_map

engagement_map.high_engagement_segmentsarraySegments with predicted high viewer attention. Each has start_s, end_s, start_ms, end_ms, reason.
engagement_map.low_engagement_risk_segmentsarraySegments where viewers are likely to skip. Each has start_s, end_s, start_ms, end_ms, reason. Use for ad placement.
engagement_map.hook_momentsarrayIndividual high-impact moments. Each has timestamp_s, timestamp_ms, reason.
engagement_map.payoff_momentsarrayKey revelations or satisfying conclusions. Each has timestamp_s, timestamp_ms, reason.
engagement_map.cta_momentsarrayCall-to-action moments. Each has timestamp_s, timestamp_ms, text, type.

on_screen_text

on_screen_textarrayAll text detected on screen across keyframes, deduplicated and timestamped.
on_screen_text[].timestamp_sfloatTimestamp in seconds.
on_screen_text[].timestamp_msintegerTimestamp in milliseconds.
on_screen_text[].textstringExact text as it appears on screen.
on_screen_text[].typestring"title_card" | "url" | "graphic" | "chyron" | "slide_text" | "other".

keyframes (Pass 1 raw data)

keyframesarrayPer-frame analysis from Pass 1. One entry per extracted keyframe.
keyframes[].timestamp_sfloatFrame timestamp in seconds.
keyframes[].timestamp_msintegerFrame timestamp in milliseconds.
keyframes[].descriptionstringDetailed natural language description of what is happening in frame.
keyframes[].settingstring"home_office" | "studio" | "outdoors" | "office" | "classroom" | "other".
keyframes[].presenter_emotionstring"enthusiastic" | "calm" | "serious" | "humorous" | "concerned" | "neutral".
keyframes[].body_languagestringDescription of posture, gesture, eye contact.
keyframes[].engagement_levelstring"high" | "medium" | "low".
keyframes[].on_screen_textstring[]Text visible in this specific frame.
keyframes[].objectsstring[]Significant objects visible.
keyframes[].brands_visiblestring[]Brand names, logos, or products identified.
keyframes[].production_markersstring[]"lower_thirds" | "chyron" | "broll" | "screen_recording" | "slide" | "graphic_overlay".
keyframes[].lighting_qualitystring"excellent" | "good" | "fair" | "poor".
keyframes[].framing_qualitystring"excellent" | "good" | "fair" | "poor".
keyframes[].face_visiblebooleanA face is visible in this frame.
keyframes[].face_centeredbooleanFace is roughly centred — safe for vertical crop.
keyframes[].thumbnail_scorefloat0–1 score for thumbnail suitability. High face visibility + good lighting + engaged expression = high score.

intent — Pass 4c (v2.0 inferential)

intent.viewer_intentstringWhy a viewer watches: "research" | "entertainment" | "purchase_decision" | "education" | "news".
intent.content_intentstringWhat the creator made: "documentary" | "product_demo" | "interview" | "tutorial" | "vlog" | "news" | "debate" | "other".
intent.call_to_actionstringPrimary CTA: "subscribe" | "visit_site" | "purchase" | "learn_more" | "none".
intent.emotional_tonestring"inspiring" | "alarming" | "humorous" | "melancholic" | "neutral" | "tense".
intent.controversy_scorefloat0.0–1.0. Higher = more likely to provoke polarised reactions.
intent.controversy_reasonstring | nullWhy the controversy score was assigned, null if low.

audience — Pass 4c (v2.0 inferential)

audience.descriptionstringNatural-language description of the likely audience.
audience.psychographic_tagsstring[]Interest/values tags, e.g. ["early_adopter", "privacy_conscious"].
audience.age_skewstring"13-17" | "18-24" | "25-34" | "25-44" | "35-54" | "45-64" | "55+" | "broad".
audience.gender_skewstring"neutral" | "skews_male" | "skews_female".
audience.affluence_signalstring"low" | "mid" | "high".
audience.knowledge_levelstringAssumed domain knowledge of the target viewer.
audience.viewing_contextstring"lean_back" | "active_research" | "mobile_commute" | "background".

topic_segments — Pass 4c (v2.0 inferential)

topic_segmentsarrayTime-ranged topic spans for mid-roll placement and chaptered search.
topic_segments[].start_s / end_sfloatSpan start/end in seconds (also _ms variants).
topic_segments[].topicstringShort topic label for the span.
topic_segments[].summarystringOne-line summary of the span.
topic_segments[].keywordsstring[]Keywords for this span.
topic_segments[].ad_suitablebooleanWhether this span is a good place to insert an ad break.

content_classification

iab_categoriesarrayLegacy v1 IAB array (id, name, confidence). Retained for back-compat — prefer content_classification.iab_v3 below for new integrations.
iab_categories[].idstringIAB taxonomy ID, e.g. "IAB19" or "IAB19-18".
iab_categories[].namestringHuman name, e.g. "Technology & Computing".
iab_categories[].confidencefloatClassification confidence 0–1.
iab_primarystringLegacy v1 top-level IAB category ID, e.g. "IAB19". See iab_v3.primary_category for the v2.0 equivalent.
tagsstring[]Free-form discovery tags for platform search/recommendation.
keywords_rankedarrayTop keywords with relevance scores. Each has keyword, relevance.
primary_languagestringLanguage code, e.g. "en".
contains_musicbooleanMusic detected in audio.
contains_speechbooleanSpeech detected in audio.
speech_claritystring"high" | "medium" | "low".
accessibility_scorefloat0–1 score based on speech clarity and caption availability.
content_ratingstring"general" | "teen" | "mature" | "adult".
suitable_for_childrenbooleanContent safe for children.
suitable_for_familiesbooleanContent safe for family viewing.

garm — GARM Brand Safety Framework

garm.overall_floorstring"floor_1" | "floor_2" | "floor_3" | "floor_4". floor_1 = most permissive.
garm.suitable_for_adsbooleantrue if video passes floor_1 (minimum acceptable standard).
garm.brand_safety_scoreintegerComposite 0–100 score. 96+ = broadly brand safe.
garm.categories.adult_explicit_sexual_contentobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.arms_ammunitionobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.crime_harmful_actsobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.death_injury_military_conflictobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.online_piracyobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.hate_speech_acts_of_aggressionobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.obscenity_profanityobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.illegal_drugs_tobacco_vapingobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.spam_harmful_contentobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.terrorismobjectpresent: boolean, floor: null | floor_1–4.
garm.categories.debated_sensitive_social_issuesobjectpresent: boolean, floor: null | floor_1–4.

rights

rights.content_typestring"original" | "licensed" | "ugc" | "derivative".
rights.contains_third_party_ipbooleanThird-party intellectual property detected.
rights.detected_brandsstring[]Brand names or logos detected on screen.
rights.detected_musicstring[]Music detected — title/artist if identifiable.
rights.detected_stock_footagebooleanStock footage detected.
rights.copyright_flagsstring[]Content that may require rights clearance.

iab_v3 — IAB Content Taxonomy 3.0 (v2.0 inferential)

iab_v3.versionstringTaxonomy version, "3.0".
iab_v3.categoriesarrayAll matched categories. Each entry below:
categories[].unique_idstringIAB v3 Unique ID, e.g. "596" or "JLBCU7".
categories[].labelstringHuman label, e.g. "Technology & Computing".
categories[].tierintegerTaxonomy tier depth (1 = top level).
categories[].confidencefloatClassification confidence 0–1.
iab_v3.primary_categoryobject | nullSingle best category (same shape as categories[]). Declare directly to a DSP.
iab_v3.descriptive_vectorsobjectIAB Audience/Content descriptive vectors:
descriptive_vectors.content_typestring | nulle.g. "video".
descriptive_vectors.content_channelstring | nullDistribution channel, e.g. "streaming".
descriptive_vectors.data_sourcestring | nullProvenance of the classification.
descriptive_vectors.genrestring | nullGenre descriptor.
descriptive_vectors.languagestring | nullPrimary language code, e.g. "en".
descriptive_vectors.liveboolean | nullWhether the content is live.

named_entities (v2.0 inferential)

named_entitiesarrayPeople, orgs, places, products, brands detected across transcript + on-screen text.
named_entities[].namestringEntity name.
named_entities[].typestring"person" | "organisation" | "place" | "product" | "brand" | "film" | "event" | "concept".
named_entities[].mentionsarrayEach: timestamp_s, timestamp_ms, source (on_screen_text | transcript | visual).
named_entities[].relevance_scorefloat0–1 importance to the video.
named_entities[].wikipedia_likelybooleanWhether the entity likely has a Wikipedia page (for entity linking).

search_signals (v2.0 inferential)

search_signals.search_intent_keywordsstring[]Queries this video would satisfy — for SEO / internal search.
search_signals.geo_relevancestring[]Regions/markets the content is relevant to.
search_signals.primary_languagestring | nullPrimary language code, e.g. "en".
search_signals.secondary_languagesstring[]Other languages detected.

clip_suggestions

clip_suggestionsarrayShort-form candidates ranked by virality potential.
clip_suggestions[].indexintegerZero-based rank.
clip_suggestions[].start_sfloatClip start in seconds.
clip_suggestions[].end_sfloatClip end in seconds.
clip_suggestions[].start_msintegerClip start in milliseconds.
clip_suggestions[].end_msintegerClip end in milliseconds.
clip_suggestions[].duration_sfloatClip duration in seconds.
clip_suggestions[].titlestringSuggested clip title.
clip_suggestions[].hook_textstringOpening line / hook for this clip.
clip_suggestions[].reasonstringWhy this segment was suggested.
clip_suggestions[].virality_scorefloat0–1 predicted virality score.
clip_suggestions[].short_form_suitablebooleanDuration under 60s and self-contained.
clip_suggestions[].vertical_crop_viablebooleanFace centred — safe to crop 16:9 to 9:16 for Reels/Shorts/TikTok.
clip_suggestions[].family_safebooleanClip content is family-safe.

thumbnail_candidates

thumbnail_candidatesarrayFrames ranked by thumbnail_score from Pass 1 keyframe analysis.
thumbnail_candidates[].rankinteger1-based rank (1 = best).
thumbnail_candidates[].timestamp_sfloatFrame timestamp in seconds.
thumbnail_candidates[].timestamp_msintegerFrame timestamp in milliseconds.
thumbnail_candidates[].scorefloatThumbnail quality score 0–1.
thumbnail_candidates[].reasonstringWhy this frame scores well.
thumbnail_candidates[].frame_object_idstringZiB object ID of the extracted JPEG. Use getCdnUrl() to get the public URL.

ad_targeting — commercial intelligence (v2.0)

Top-level block for monetisation and contextual ad targeting. Present when the inferential pass runs; may be null on older v1.0 sidecars.

ad_targeting.contextual_keywordsstring[]Keywords for contextual ad matching.
ad_targeting.brand_affinitystring[]Brands/verticals that align with the content.
ad_targeting.competitor_exclusionsstring[]Brands to exclude (e.g. a competitor is featured).
ad_targeting.ad_moment_candidatesarrayNatural break points for ad insertion. Each entry below:
ad_moment_candidates[].timestamp_s / _msnumberBreak point time (seconds + milliseconds).
ad_moment_candidates[].typestring | null"scene_break" | "topic_transition" | "pause" | "chapter_start".
ad_moment_candidates[].scorefloat0–1 quality of this break for ad insertion.
ad_moment_candidates[].preceding_context / following_contextstring | nullShort context around the break.
ad_targeting.cpm_floor_suggestionobject | nullSuggested price floor: floor_usd (float), confidence (0–1), rationale (string).
ad_targeting.suitable_for_adsboolean | nullOverall ad-suitability verdict (mirrors GARM gate).
ad_targeting.recommended_ad_categoriesstring[]Ad verticals that fit the content.
ad_targeting.monetisation_notesstring | nullFree-text notes on monetisation (sponsorships, competitor mentions).

ai_retrieval — RAG / agent retrieval (v2.0)

ai_retrieval.summary_shortstring | nullOne- to two-sentence summary for embeddings / agent context.
ai_retrieval.summary_structuredstring[]Bullet-point summary lines.
ai_retrieval.qa_pairsarrayGenerated Q&A pairs. Each: question, answer, timestamp_s (where answered).

Full ZibSidecar example

json
{
  "zib_sidecar_version": "2.0",
  "video_object_id": "0c8e3765-9b2b-4a29-82cb-2f1911fe750c",
  "generated_at": "2026-05-25T10:00:00Z",
  "processing_time_ms": 4821,
  "models_used": {
    "transcription": "recommended",
    "vision": "standard",
    "inference": "standard"
  },

  "video": {
    "duration_s": 612.4,
    "duration_ms": 612400,
    "width": 1920,
    "height": 1080,
    "aspect_ratio": "16:9",
    "fps": 29.97,
    "codec": "h264",
    "audio_channels": 2,
    "audio_sample_rate": 44100,
    "file_size_bytes": 284720384,
    "bitrate_kbps": 3720
  },

  "transcript": {
    "language": "en",
    "language_confidence": 0.98,
    "full_text": "Welcome back to the channel...",
    "word_count": 1842,
    "speaking_pace_wpm": 148,
    "silence_ratio": 0.12,
    "multiple_speakers": false,
    "segments": [
      {
        "index": 0,
        "start_s": 0.0,   "end_s": 4.2,
        "start_ms": 0,    "end_ms": 4200,
        "text": "Welcome back to the channel.",
        "speaker": "speaker_1",
        "confidence": 0.97,
        "intent": "hook",
        "viewer_engagement_prediction": "high",
        "is_hook_moment": true,
        "is_payoff_moment": false,
        "cta_detected": false,
        "cta_text": null,
        "question_posed": null,
        "key_claim": null
      }
    ]
  },

  "subtitles": {
    "language": "en",
    "srt_object_id": "zib_abc123.en.srt",
    "vtt_object_id": "zib_abc123.en.vtt",
    "srt_inline": "1\n00:00:00,000 --> 00:00:04,200\nWelcome back to the channel.\n\n...",
    "vtt_inline": "WEBVTT\n\n00:00:00.000 --> 00:00:04.200\nWelcome back to the channel.\n\n..."
  },

  "chapters": [
    {
      "index": 0,
      "title": "Introduction",
      "summary": "Host explains privacy limitations of centralised cloud storage.",
      "start_s": 0.0,   "end_s": 142.0,
      "start_ms": 0,    "end_ms": 142000,
      "keywords": ["introduction", "decentralised storage", "privacy"],
      "dominant_intent": "hook",
      "engagement_level": "high"
    }
  ],
  "chapters_youtube_format": "0:00 Introduction\n2:22 How ZiB Works\n6:20 Getting Started",

  "scene_understanding": {
    "executive_summary": "A single presenter in a professional home office...",
    "title_suggestion": "ZiB Network: Private Decentralised Storage Explained",
    "description_suggestion": "Learn how ZiB Network solves the privacy problems of traditional cloud...",
    "content_format": "tutorial",
    "narrative_arc": "problem_solution",
    "target_audience": "Intermediate to advanced software developers.",
    "assumed_knowledge_level": "intermediate",
    "unique_value": "Hands-on technical explanation with concrete failure demonstrations.",
    "series_indicators": "References 'last week's video on encryption' at 2:14.",
    "guest_present": false,
    "monetisation_notes": "No sponsored segments. Strong CTA at 9:45.",

    "intent": {
      "viewer_intent": "research",
      "content_intent": "tutorial",
      "call_to_action": "learn_more",
      "emotional_tone": "inspiring",
      "controversy_score": 0.08,
      "controversy_reason": null
    },
    "audience": {
      "description": "Software developers evaluating decentralised storage.",
      "psychographic_tags": ["early_adopter", "privacy_conscious", "technical"],
      "age_skew": "25-44",
      "gender_skew": "skews_male",
      "affluence_signal": "mid",
      "knowledge_level": "intermediate",
      "viewing_context": "active_research"
    },
    "topic_segments": [
      { "start_s": 0.0, "end_s": 142.0, "start_ms": 0, "end_ms": 142000,
        "topic": "Cloud privacy problem", "summary": "Why centralised cloud can read your data.",
        "keywords": ["privacy", "cloud"], "ad_suitable": false }
    ],

    "production": {
      "quality": "prosumer",
      "setting": "home_office",
      "cut_style": "moderate",
      "talking_head_ratio": 0.78,
      "has_screen_recording": true,
      "has_broll": false,
      "has_slides": false,
      "has_lower_thirds": true,
      "has_captions_burned_in": false,
      "has_graphic_overlays": true,
      "presenter_count": 1,
      "consistent_framing": true,
      "consistent_lighting": true
    },

    "engagement_map": {
      "high_engagement_segments": [
        { "start_s": 0.0, "end_s": 45.0, "start_ms": 0, "end_ms": 45000,
          "reason": "Strong hook, high presenter energy" }
      ],
      "low_engagement_risk_segments": [
        { "start_s": 95.0, "end_s": 142.0, "start_ms": 95000, "end_ms": 142000,
          "reason": "Slow transition, background context" }
      ],
      "hook_moments": [
        { "timestamp_s": 8.4, "timestamp_ms": 8400,
          "reason": "Provocative claim: most cloud storage can be read by the provider" }
      ],
      "payoff_moments": [
        { "timestamp_s": 187.0, "timestamp_ms": 187000,
          "reason": "Live demonstration of node failure recovery" }
      ],
      "cta_moments": [
        { "timestamp_s": 585.0, "timestamp_ms": 585000,
          "text": "Link to GitHub in description", "type": "link" }
      ]
    },

    "on_screen_text": [
      { "timestamp_s": 12.0, "timestamp_ms": 12000,
        "text": "How ZiB Storage Works", "type": "title_card" }
    ],

    "keyframes": [
      {
        "timestamp_s": 0.0, "timestamp_ms": 0,
        "description": "Person seated at standing desk, looking directly at camera.",
        "setting": "home_office",
        "presenter_emotion": "enthusiastic",
        "body_language": "Forward lean, open hand gestures, direct eye contact",
        "engagement_level": "high",
        "on_screen_text": [],
        "objects": ["person", "desk", "monitor", "microphone"],
        "brands_visible": [],
        "production_markers": [],
        "lighting_quality": "good",
        "framing_quality": "excellent",
        "face_visible": true,
        "face_centered": true,
        "thumbnail_score": 0.91
      }
    ]
  },

  "content_classification": {
    "iab_categories": [
      { "id": "IAB19", "name": "Technology & Computing", "confidence": 0.97 },
      { "id": "IAB19-18", "name": "Internet Technology", "confidence": 0.91 }
    ],
    "iab_primary": "IAB19",
    "iab_v3": {
      "version": "3.0",
      "categories": [
        { "unique_id": "596", "label": "Technology & Computing", "tier": 1, "confidence": 0.97 },
        { "unique_id": "618", "label": "Information and Network Security", "tier": 2, "confidence": 0.88 }
      ],
      "primary_category": { "unique_id": "596", "label": "Technology & Computing", "tier": 1, "confidence": 0.97 },
      "descriptive_vectors": {
        "content_type": "video", "content_channel": "streaming", "data_source": "ai_classification",
        "genre": "educational", "language": "en", "live": false
      }
    },
    "named_entities": [
      { "name": "Amazon S3", "type": "product",
        "mentions": [{ "timestamp_s": 18.0, "timestamp_ms": 18000, "source": "transcript" }],
        "relevance_score": 0.62, "wikipedia_likely": true }
    ],
    "search_signals": {
      "search_intent_keywords": ["private cloud storage", "encrypted file storage"],
      "geo_relevance": ["global"],
      "primary_language": "en",
      "secondary_languages": []
    },
    "tags": ["decentralised storage", "ZiB", "encryption", "privacy", "Web3"],
    "keywords_ranked": [
      { "keyword": "decentralised storage", "relevance": 0.96 }
    ],
    "primary_language": "en",
    "contains_music": false,
    "contains_speech": true,
    "speech_clarity": "high",
    "accessibility_score": 0.87,
    "content_rating": "general",
    "suitable_for_children": false,
    "suitable_for_families": false,
    "garm": {
      "overall_floor": "floor_1",
      "suitable_for_ads": true,
      "brand_safety_score": 96,
      "categories": {
        "adult_explicit_sexual_content": { "present": false, "floor": null },
        "arms_ammunition": { "present": false, "floor": null },
        "crime_harmful_acts": { "present": false, "floor": null },
        "death_injury_military_conflict": { "present": false, "floor": null },
        "online_piracy": { "present": false, "floor": null },
        "hate_speech_acts_of_aggression": { "present": false, "floor": null },
        "obscenity_profanity": { "present": false, "floor": null },
        "illegal_drugs_tobacco_vaping": { "present": false, "floor": null },
        "spam_harmful_content": { "present": false, "floor": null },
        "terrorism": { "present": false, "floor": null },
        "debated_sensitive_social_issues": { "present": false, "floor": null }
      }
    },
    "rights": {
      "content_type": "original",
      "contains_third_party_ip": false,
      "detected_brands": [],
      "detected_music": [],
      "detected_stock_footage": false,
      "copyright_flags": []
    }
  },

  "clip_suggestions": [
    {
      "index": 0,
      "start_s": 8.4,   "end_s": 55.2,
      "start_ms": 8400, "end_ms": 55200,
      "duration_s": 46.8,
      "title": "What if your cloud storage provider can read everything you upload?",
      "hook_text": "Most people don't realise that every file you upload to S3 can be read by Amazon.",
      "reason": "Opens with provocative hook, delivers clear explanation, natural conclusion.",
      "virality_score": 0.81,
      "short_form_suitable": true,
      "vertical_crop_viable": true,
      "family_safe": true
    }
  ],

  "thumbnail_candidates": [
    {
      "rank": 1,
      "timestamp_s": 12.4, "timestamp_ms": 12400,
      "score": 0.91,
      "reason": "Clear face, direct eye contact, forward lean, good lighting.",
      "frame_object_id": "zib_abc123.thumb.0001.jpg"
    }
  ],

  "ad_targeting": {
    "contextual_keywords": ["cloud storage", "data privacy", "developer tools"],
    "brand_affinity": ["password managers", "VPNs", "developer SaaS"],
    "competitor_exclusions": ["Amazon Web Services"],
    "ad_moment_candidates": [
      { "timestamp_s": 142.0, "timestamp_ms": 142000, "type": "chapter_start",
        "score": 0.82, "preceding_context": "Intro wraps up",
        "following_context": "How ZiB works begins" }
    ],
    "cpm_floor_suggestion": { "floor_usd": 18.5, "confidence": 0.7,
      "rationale": "Affluent technical audience, brand-safe, high intent." },
    "suitable_for_ads": true,
    "recommended_ad_categories": ["B2B SaaS", "Developer Tools", "Cybersecurity"],
    "monetisation_notes": "Strong fit for developer-tool sponsorships. No competitor conflicts beyond AWS."
  },

  "ai_retrieval": {
    "summary_short": "A developer-focused tutorial explaining how ZiB Network provides private, decentralised storage.",
    "summary_structured": [
      "Problem: centralised cloud providers can read uploaded data.",
      "Solution: client-side encryption + erasure-coded decentralised storage.",
      "Demo: live node-failure recovery."
    ],
    "qa_pairs": [
      { "question": "Can cloud providers read my files?", "answer": "On most centralised providers, yes — ZiB encrypts client-side first.", "timestamp_s": 18.0 }
    ]
  }
}

Webhooks

Instead of polling, register a webhook_url (and optional webhook_secret) on the job request. ZiB POSTs a JSON event to your endpoint as the job progresses. When vision finishes you receive a vision.complete event carrying the sidecar_file_id — fetch the sidecar from the CDN exactly as in Quick Start.

Registering the webhook

Standalone vision job — pass the webhook fields in the compute request body:

http
POST /v1/api/compute/{objectId}
Authorization: Bearer ACCESS_KEY:SECRET_KEY
Content-Type: application/json

{
  "vision": "standard",
  "webhook_url": "https://your-app.example.com/hooks/zib",
  "webhook_secret": "whsec_your_shared_secret"
}

Combined pipeline — request encode + vision together; the same endpoint receives both theencoding.* lifecycle events andvision.complete:

http
POST /v1/api/encoding/request
Authorization: Bearer ACCESS_KEY:SECRET_KEY
Content-Type: application/json

{
  "file_id": "{objectId}",
  "vision": "hq",
  "webhook_url": "https://your-app.example.com/hooks/zib",
  "webhook_secret": "whsec_your_shared_secret"
}
You can also set a webhook at upload time via the pipeline object on POST /v1/api/upload/initiate ({ encode: true, vision: "standard", webhook_url, webhook_secret }), or save a default webhook_secret on your customer profile so you can omit it per request.

Events

vision.completecomputeVision finished. Payload includes sidecar_file_id. This is the one to act on for Vision AI.
transcription.completecomputeTranscription finished (if requested). Payload includes srt_file_id / vtt_file_id.
encoding.queuedencodingCombined pipeline only — encode job accepted.
encoding.startedencodingCombined pipeline only — encoding began. Carries vision_model.
encoding.shardingencodingCombined pipeline only — segments being distributed.
encoding.completeencodingCombined pipeline only — playback ready (hls_manifest_url). Vision still runs after; wait for vision.complete.
encoding.failedencodingCombined pipeline only — encode failed (error_message).

vision.complete payload

json
{
  "event": "vision.complete",
  "job_id": "47af4a5c-f95b-426e-90fc-4ec184bf622d",
  "file_id": "0c8e3765-9b2b-4a29-82cb-2f1911fe750c",
  "sidecar_file_id": "b1d2…",
  "timestamp": "2026-05-25T10:00:00Z"
}

Delivered with these headers:

X-ZiB-EventheaderThe event name, e.g. "vision.complete".
X-ZiB-Job-IDheaderThe job_id.
X-ZiB-SignatureheaderHMAC-SHA256 signature as "sha256=<hex>" (only sent when a non-empty webhook_secret is set).

Verifying the signature

The signature is HMAC-SHA256(webhook_secret, raw_request_body), hex-encoded, prefixed with sha256=. Compute it over the raw bytes of the body (before JSON parsing) and compare with a constant-time check.

javascript
import crypto from 'crypto';

// Express example — capture the RAW body for this route
app.post('/hooks/zib', express.raw({ type: 'application/json' }), (req, res) => {
  const sig = req.header('X-ZiB-Signature') || '';
  const expected = 'sha256=' + crypto
    .createHmac('sha256', process.env.ZIB_WEBHOOK_SECRET)
    .update(req.body)                       // req.body is a Buffer (raw bytes)
    .digest('hex');

  const ok = sig.length === expected.length &&
    crypto.timingSafeEqual(Buffer.from(sig), Buffer.from(expected));
  if (!ok) return res.status(401).send('bad signature');

  const event = JSON.parse(req.body.toString('utf8'));
  if (event.event === 'vision.complete') {
    const sidecarUrl = 'https://cdn.zibnetwork.com/objects/' + event.sidecar_file_id;
    enqueueSidecarFetch(event.file_id, sidecarUrl);   // fetch + JSON.parse async
  }
  res.sendStatus(200);   // ACK quickly; do work out of band
});
Respond 2xx quickly. ZiB retries failed deliveries up to 3 times with 1s / 2s / 4s backoff, then gives up — so do the sidecar fetch and any heavy work asynchronously after you ACK. Deliveries are best-effort; keep polling GET /v1/api/compute/job/:job_id as a fallback for jobs you never hear back on.

All Timestamps Have Two Formats

Every timed field in the ZibSidecar appears in both _s (float, seconds) and _ms (integer, milliseconds). No consumer ever needs to convert units. Use whichever format matches your platform — video players typically want seconds, ad servers and caption systems typically want milliseconds.
javascript
// Both are always present — pick the unit that matches your system
const segment = sidecar.transcript.segments[0];
segment.start_s   // => 0.0        (float seconds — for video players)
segment.start_ms  // => 0          (integer ms — for ad servers, caption pipelines)

const chapter = sidecar.chapters[0];
chapter.start_s   // => 142.0
chapter.start_ms  // => 142000

const clip = sidecar.clip_suggestions[0];
clip.start_ms     // => 8400       (integer — no floating point rounding to worry about)

const thumb = sidecar.thumbnail_candidates[0];
thumb.timestamp_ms // => 12400     (seek video player to this ms)

Consumer Guide

How downstream platforms map from the ZibSidecar to their own ingest formats. The sidecar is designed to be rich enough that any consumer can extract what they need without a translation layer in ZiB.

RiteStream / Video platforms

javascript
const { scene_understanding, content_classification, chapters } = sidecar;

// Auto-populate upload form
form.title       = scene_understanding.title_suggestion;
form.description = scene_understanding.description_suggestion;
form.tags        = content_classification.tags;

// Paste directly into YouTube description field
form.chapters    = sidecar.chapters_youtube_format;

// Ad placement — brand safety gate
const brandSafe = content_classification.garm.brand_safety_score > 80;
const adFloor   = content_classification.garm.overall_floor; // "floor_1"

// Place mid-roll ad at a natural viewing lull
const adMidrollAt = sidecar.scene_understanding.engagement_map
  .low_engagement_risk_segments[0]?.start_ms;

// Search indexing
const searchDoc = {
  id:          fileId,
  title:       scene_understanding.title_suggestion,
  body:        scene_understanding.executive_summary,
  transcript:  sidecar.transcript.full_text,
  tags:        content_classification.tags,
  iab:         content_classification.iab_primary,
};

Content licensing / AllRites

javascript
// Caption delivery — segments have start_ms and end_ms ready-to-use
const captions = sidecar.transcript.segments.map(s => ({
  start: s.start_ms,
  end:   s.end_ms,
  text:  s.text,
}));

// SRT available inline — no second fetch needed
const srtContent = sidecar.subtitles.srt_inline;
const vttContent = sidecar.subtitles.vtt_inline;

// Content moderation and ad eligibility
const adEligible  = sidecar.content_classification.garm.suitable_for_ads;
const iabCategory = sidecar.content_classification.iab_primary; // e.g. "IAB19"

// Rights and IP detection
const { rights } = sidecar.content_classification;
if (rights.contains_third_party_ip || rights.copyright_flags.length > 0) {
  flagForRightsReview(fileId, rights);
}

Short-form / Clips (Reels, Shorts, TikTok)

javascript
// Find clips suitable for Reels/Shorts
const shortFormClips = sidecar.clip_suggestions.filter(c => c.short_form_suitable);

// Further filter to those safe for vertical crop (9:16)
const verticalClips = shortFormClips.filter(c => c.vertical_crop_viable);

// Each clip has everything needed to create a derivative
verticalClips.forEach(clip => {
  console.log(clip.start_ms, clip.end_ms); // trim points in ms
  console.log(clip.title);                 // suggested title
  console.log(clip.hook_text);             // opening caption
  console.log(clip.virality_score);        // 0-1 predicted virality
});

Professional NLE (Premiere, DaVinci, Final Cut)

javascript
// Import sequence markers from chapters
const markers = sidecar.chapters.map(ch => ({
  name: ch.title,
  in:   ch.start_ms,
  out:  ch.end_ms,
}));

// Caption track import — SRT inline, no extra fetch
const captionTrack = sidecar.subtitles.srt_inline;

// Best poster frame — frame_object_id is a ZiB object, get CDN URL
const bestThumb = sidecar.thumbnail_candidates[0];
const thumbUrl  = zib.getCdnUrl(bestThumb.frame_object_id);

// On-screen text for auto-generated lower thirds
sidecar.scene_understanding.on_screen_text.forEach(item => {
  if (item.type === 'chyron' || item.type === 'title_card') {
    addLowerThird(item.timestamp_ms, item.text);
  }
});

Programmatic ad platforms / DSPs

javascript
const { content_classification: cc, ad_targeting: ad } = sidecar;

// IAB Content Taxonomy 3.0 (v2.0) — readable by all major DSPs without translation
const targeting = {
  iab_categories: cc.iab_v3.categories.map(c => c.unique_id), // ["596", "618"]
  iab_primary:    cc.iab_v3.primary_category?.unique_id,       // "596"
  vectors:        cc.iab_v3.descriptive_vectors,
};

// GARM brand safety — same standard used by Snapchat, YouTube, all DSPs
const brandSafety = {
  floor:          cc.garm.overall_floor,  // "floor_1"
  score:          cc.garm.brand_safety_score,
  suitable:       cc.garm.suitable_for_ads,
  categories:     cc.garm.categories,    // 11 GARM categories, each with floor
};

// v2.0 ad_targeting — contextual keywords, price floor, and break points
const bid = {
  keywords:    ad?.contextual_keywords,
  exclude:     ad?.competitor_exclusions,
  floorUsd:    ad?.cpm_floor_suggestion?.floor_usd,
  adBreaks:    (ad?.ad_moment_candidates || []).map(m => m.timestamp_ms),
};

// Engagement-driven ad slot selection (fallback / complement to ad_moment_candidates)
const adSlots = sidecar.scene_understanding.engagement_map
  .low_engagement_risk_segments
  .map(seg => ({ in_ms: seg.start_ms, out_ms: seg.end_ms, reason: seg.reason }));

IAB Content Taxonomy 3.0

ZiB uses the IAB Content Taxonomy 3.0. Category IDs like IAB19 and IAB19-18 are the actual standard used by all major ad platforms, SSPs, and DSPs. The sidecar outputs real taxonomy IDs — no translation layer is needed for DSP integration. The full taxonomy is maintained at github.com/InteractiveAdvertisingBureau/Taxonomies.

On v2.0 sidecars, read content_classification.iab_v3: iab_v3.primary_category.unique_id is the single best category and iab_v3.categories[] carries every matched category with its unique_id, label, tier, and confidence. Declare the primary unique_id directly in your DSP content-category field — no mapping required. The legacy iab_categories / iab_primary fields remain for v1.0 back-compat.

GARM Brand Safety

The 11 GARM categories and floor_1–floor_4 floor levels in the sidecar are the GARM Brand Safety Floor & Suitability Framework, used by Snapchat, YouTube, and all major DSPs. The framework is maintained at wfanet.org.

Floor levels

Floor
Meaning
Typical use
floor_1Minimum acceptableMost brand advertising. suitable_for_ads: true
floor_2Low suitabilityRestricted categories only (alcohol, gambling)
floor_3High riskNot suitable for standard brand campaigns
floor_4UnsafeBlock entirely. suitable_for_ads: false

brand_safety_score is a composite 0–100 score. A score of 96+ means the content is broadly brand safe across all 11 categories. suitable_for_ads: true means the video passes floor_1 — the industry minimum standard for advertising suitability. Each of the 11 categories also carries an individual floor level, so ad platforms can apply their own category-level exclusion rules independently of the overall score.