FIND.FRAME.FEED: Orchestrating Agentic AI for Unified Asset Search and Vertical Video
Written by: Gino Bautista | Contributors: Quan Vo, Hau Pham, Hieu Nguyen & James Roe
In the fast-paced world of Media & Entertainment, organizations are under immense pressure to instantly adapt horizontal footage for mobile streaming and vertical social feeds like TikTok and Instagram Reels. Simultaneously, the fragmentation of assets across disparate storage locations and the need to log a multitude of live feeds are overwhelming content operations. These video assets often lack consistent metadata taxonomies and are indexed using incompatible embedding models, creating significant bottlenecks in the media supply chain for capturing and monetizing viral moments.
To address these challenges, Wizeline partnered with TwelveLabs to incorporate their latest research on multi-modal video search, along with the newest advancements in agentic AI, into Wizeline’s plug-in-based AI Media Accelerator framework, known as Wize Media Suite, automating the live/archive-to-vertical-video workflow.
This orchestration leverages a “swarm of agents” using TwelveLabs’ Pegasus and Marengo, the Anthropic Claude family of models, Amazon Titan or Nova Image and Text Embeddings, open source Computer Vision models and Amazon Bedrock AgentCore to create a unified, intent-based search and vertical video pipeline. This orchestration layer, specifically Bedrock AgentCore, combined with the Strands agent framework, manages the complexity of multi-modal metadata enrichment, search, and deep video understanding.
Integrated with an Agentic AI assistant that can recommend potential viral clips, create storyboards and assemble reels with animated captioning, this accelerator collapses the time to in-platform vertical content engagement from days to minutes.

Sample Reference Architecture of Unified Agentic Search
Near-Live Highlights & Metadata with Pegasus
One of the newest innovations of the accelerator is the integration of TwelveLabs Pegasus for agentic understanding of live broadcasts. By offering an alternative to the accelerator’s existing chapter detection and metadata enrichment plug-ins, which necessitated the complex coordination of diverse AI services like Anthropic’s Claude and various Amazon AWS tools (Transcribe, SageMaker, Rekognition and Comprehend), the system now achieves unified video understanding, scene identification, and clip segmentation with a single foundation model, TwelveLabs Pegasus.
Advanced Metadata Generation
Pegasus analyzes visual and auditory signals directly by processing live Transport Stream (TS) files converted into discrete video segments. To maintain accuracy during chapter finalization, the implementation utilizes a 5-minute latency buffer for processing live feeds into chapters and metadata, which still offers an improvement in speed for many marketing teams wanting to post live, broadcast-quality footage more quickly and cost-effectively to social media. This specialized Pegasus plug-in extracts four essential metadata categories that surpass standard timestamping:
- On-Screen Text: Captures verbatim lower-third and chyron data
- Visual Descriptions: Provides scene-level prose describing actions independent of audio
- Segment Types: Classifies broadcast formats, such as sports or news
- Logos: Identifies brand or network identifiers automatically
Architectural Simplicity and Latency Trade-offs
TwelveLabs’ Pegasus plug-in provides a clear operational deployment and maintenance benefit of simplicity over some of Wizeline’s other AI media plug-ins.
Alternative metadata generation and embedding generation plug-ins within the accelerator make use of frame-level multi-modal video analysis using Claude Anthropic models in conjunction with Amazon Transcribe, Rekognition, and Comprehend, Titan or Nova embeddings or even AWS Elemental Inference, enabling a time-to-segment that can be closer to near-real time depending on the configuration.
The trade-off is that these plug-ins are more complex to deploy and maintain as they require multiple AI services to be orchestrated to achieve the same metadata generation that Pegasus can deliver as a single foundation model.
The metadata generated from these plug-ins is fed back into our search indexing, allowing the AI-assistant to choose between using Pegasus-enriched data, Claude-enriched metadata or both, in order to synthesize the most accurate answers.
The Search Engine: Multi-Vector Optimization with Marengo
Video is a synchronized bundle of visual action, non-speech audio and spoken transcription. TwelveLabs Marengo allows us to treat these as separable, inspectable channels.
Indexing: Fused vs. Separate Fields
Wizelined leveraged research from the TwelveLabs whitepaper, “A Guidance on Multi-Vector Video Search with TwelveLabs Marengo,” which discusses Fused Embeddings that collapse modalities into a single vector during ingestion. While simple, this creates irreversible bias and limits debuggability.

Our Strategy: To balance cost and performance, we implemented the Separate Fields approach. By isolating Visual, Audio and Transcription data into dedicated embedding channels, we preserve modality-specific signals for fine-grained tuning and weighted retrieval, while optimizing the storage footprint compared to the Fused embeddings approach.
Querying: Intent-Based Routing
To determine which modality should dominate a search, we tested several methods:
- LLM Decomposition: Using an LLM to split a query like “basketball player dunking with beats playing” into visual (“dunking”) and audio (“beats”) components. This can be noisy and non-deterministic.
- Intent-based Dynamic Query Routing: We construct modality “anchors”—textual descriptions of intent (e.g., “This document contains content about spoken words” for transcriptions).
- The Winner: Intent-based Dynamic Query Routing proved superior for enterprise reliability. It is deterministic, explainable and configurable without retraining models, providing an audit trail for why a specific result was ranked.
Ranking: Weighted Reciprocal Rank Fusion (RRF)
Executing parallel queries across multiple modalities necessitates a robust strategy for result synthesis. We address this using Weighted Reciprocal Rank Fusion (RRF). Unlike standard RRF, which treats all sources equally, our approach dynamically weights each modality’s contribution based on query intent. This ensures that the final output isn’t just determined by rank position, but by the relevance of the specific medium—visual, audio or transcription—to the user’s goal.
The “Find” Orchestration Layer: Bedrock AgentCore and Strands
To manage the complexity of multi-modal search and video understanding, we implemented a highly modular architecture using Amazon Bedrock AgentCore and the Strands framework.
In-Runtime Execution Flow
The core of the system relies on a Dynamic Agent Orchestrator designed for high-concurrency video processing. Instead of a static script, each request triggers the construction of a scoped agent instance, ensuring strict isolation and context-specific execution. These “swarms of agents” can take on different roles such as “search query agent”, “metadata synthesis agent”, “planning agent”, “editor agent”, “review agent”, and more, depending on your workflow.
1. Stateless Agent Construction
Upon invocation, the runtime assembles a dedicated Agent instance. This instance is pre-configured with a behavioral specification, which defines its reasoning boundaries and output policies, and a suite of specialized toolsets. By utilizing a lightweight, high-performance foundational model for the reasoning loop, the system can maintain low-latency “thinking” phases while streaming responses back to the user in real-time.
2. The Multi-Modal Reasoning Loop
The agent operates within a decision-making loop that intelligently selects from a registry of service-based tools. This allows the agent to handle complex, non-linear queries by delegating tasks to two primary subsystems:
- Parallel Hybrid Search: To locate specific events within a massive video library, the agent triggers a dual-path search. It simultaneously queries structured metadata via vector databases and leverages multi-modal embeddings to synthesize visual, audio and transcript data in tandem. This ensures the agent captures context that simple keyword or visual-only searches would miss.
- Contextual Media Analysis: For deep-dive questions, the agent interfaces with an analytical engine that processes raw media chunks directly from secure storage. This allows for granular Q&A, such as identifying specific objects or sentiments within a scene, without needing to pre-index the entire video file.
To provide a responsive user experience, the agent utilizes asynchronous streaming. Textual “thought” chunks are pushed to the client immediately as they are generated. Critically, inline prompt constants are used to govern internal logic, such as merging results and performing metadata reduction.
The Agentic AI-Assistant for Composing Viral Reels

Taking advantage of Bedrock AgentCore’s native features like AgentCore Tools, AgentCore Memory (Conversation History), AgentCore Streaming response, Wizeline is able to deliver a Unified, Agentic AI-Assistant experience within the UI that helps creators find the best moments with timecode accuracy for their reels using conversational prompts.
The AI assistant employs a swarm of agents that understand a user’s intentions and translate them into the most effective search queries, from vector embedding-based approaches to semantic search to hybrid search and metadata filtering, allowing users to quickly find the best moments regardless of where videos are stored in a fragmented media landscape. To provide short-term conversational continuity, we integrated the Bedrock AgentCore Memory (MemoryClient), which persists multi-turn context for up to 90 days.
The “Frame” Logic: From Semantic Intent to Pixel Precision
While “finding” the moment is a linguistic challenge, “framing” it for vertical social platforms is a spatial and visual one. To bridge this, the accelerator moves beyond simple center-cropping, which often misses the action, to a sophisticated Object-Aware Re-composition engine.
Wizeline implemented a sophisticated workflow leveraging OpenCV, Meta’s Segment Anything Models (SAM) and Ultralytics YOLO as its core geometry engine. While multimodal LLMs understand what is happening, the geometry and computer vision engine works together to understand where it is happening across time in order to recommend context-aware, smart cropping and reframing. NVIDIA CV-CUDA running on EC2 instances powers the auto-framing, object detection, and smooth subject tracking for vertical video generation.
For editors and producers, it was also critical that we built interoperability into the suite to provide flexibility to do editing, clipping and post production in other NLE’s or asset and content management systems. By outputting tracking coordinates with keyframe and timecode metadata in standard JSON format, editing workflows can continue natively in professional editing suites like Adobe Premiere, Avid Media Composer or DaVinci Resolve, or processed instantly in the cloud via AWS Lambda for real-time social delivery.

Publishing to the “Feed”: Tips to Maximize Engagement & Monetization
Once the reel has been assembled, Elemental MediaConvert and FFMPEG work together to output the MP4 asset in 9:16, 16:9, and 1:1 formats at resolutions up to 4K. Utilizing the AI plug-in-generated metadata on ingest, plus Pegasus to analyze the newly created short-form video asset, and any text-to-text LLM, Claude Haiku, in this case, the suite’s agentic workflow quickly crafts titles, captions, and descriptions for each asset while applying standardized tags aligned to IAB and GARM taxonomies for discoverability and ad relevance. Hyper-personalized content features within the accelerator take it a step further by using predefined target persona prompts and style guides to recommend customized captions and posts tailored to specific audience segments, so a single master asset can be repurposed into multiple platform-ready pieces in minutes for hyper-targeted campaigns.
When it’s time to distribute, creators can publish directly to YouTube, Facebook, Instagram, TikTok, and X using automation tools like Zapier, or export clips as MP4 bundles with SRT and WebVTT subtitle files. For organizations with existing production infrastructure, the suite integrates into CMS, MAM, and DAM systems via its open API architecture and supports EDL exports in CMX 3600 format for broadcast editing workflows. The result: live and archival content reaches audiences faster, across more channels, with less manual effort.
Infrastructure-as-Code Deployed in your Enterprise Architecture
Our suite of AI media accelerators is based on modern, microservices-based open architecture software standards that can be deployed as Infrastructure-as-Code (IaC) through Terraform templates or CDK into a customer’s cloud or on-premise data centers, allowing them to take advantage of foundation models like TwelveLabs Pegasus and Marengo through Amazon Bedrock or within containerized workflows, while also giving them the freedom to reconfigure and customize the solution to their particular media workflows.
Pro-Tips for Implementation
- Context over Keywords: Combine semantic multi-vector search with metadata filters (speaker ID, date) for the ultimate hybrid experience.
- Preserve Modality Separation: For debugging, always ensure you can isolate which signal (visual vs. transcript) drove a specific result.
Are you prepared to transform your media supply chain?
Wizeline provides specialized professional services to assist clients in developing bespoke media workflows, integrating premier third-party technologies, and driving superior business results. By leveraging our AI-native frameworks and accelerators, we enable organizations to adopt more efficient, modern ways of storytelling.
Get in touch with Wizeline today to discover how our agentic accelerators can unlock the untapped value within your archives, turn live moments into viral ones, and empower you with complete control over your content’s destiny.
