Tempo is a framework designed to solve one of the hardest problems in multimodal AI: understanding very long videos without blowing up your context window or compute budget.
The Core Problem It Solves
Videos are brutally expensive for transformers. A 1-hour video at even 1 frame per second gives you 3,600 frames. At typical vision encoding resolutions, each frame becomes hundreds of tokens — potentially millions of tokens total, far beyond what any current model can process in a single context window. And even if it could, the attention computation would be prohibitively expensive (attention is O(n²) in sequence length).
Prior approaches mostly resorted to uniform frame sampling — just pick every Nth frame and hope you don’t miss anything important. This works poorly for real-world videos where interesting events are sparse and unevenly distributed.
Tempo’s Approach
Tempo introduces a query-aware temporal compressor built around a Small Vision-Language Model (SVLM). The key word is query-aware — rather than compressing the video uniformly, Tempo compresses it differently depending on what question you’re actually asking.
Pipeline
- Frame encoding — video frames are encoded into visual embeddings as usual.
- Query-aware compression — the SVLM takes both the visual embeddings and the user’s query, then identifies which temporal segments are relevant to that specific question. Relevant frames/segments are preserved at higher fidelity; unrelated stretches get aggressively compressed or dropped.
- Compressed representation passed to the main MLLM — the large multimodal model receives a much shorter, query-focused token sequence rather than the raw full video, and generates the answer.
Why the SVLM Approach Is Smart
Using a small model for compression is elegant:
- The SVLM is cheap to run — its job isn’t to answer the question, just to identify relevance.
- It acts as a smart pre-filter, doing temporal attention at a coarse level so the expensive large model only reasons over the parts that matter.
- Architecturally similar to RAG: a retriever (cheap) narrows down context before the generator (expensive) does the heavy lifting — just applied to the time dimension of video rather than a document corpus.
What It Enables
- Hour-long video Q&A — e.g., “what was the presenter doing when they mentioned the revenue figures?” over a full lecture or meeting recording
- Video summarization with focus — “summarize only the parts relevant to the product demo”
- Temporal grounding — “when does X happen?” over long content
- Surveillance and monitoring — finding specific events in hours of footage without manual scrubbing
Limitations and Open Questions
- Compression quality depends on the SVLM’s ability to judge relevance — made before the large model sees full context.
- If the query is ambiguous, the SVLM might compress away important material.
- Queries requiring understanding of patterns across time (e.g., “how does the speaker’s tone change over the talk?”) are harder than point-in-time retrieval.
Broader Significance
Tempo is part of a broader wave of research on hierarchical / cascaded multimodal processing — the principle that raw, unfiltered perceptual data should not go directly to your most expensive model. Use cheap, fast models for coarse filtering and structuring, then hand off a condensed, task-relevant representation to the powerful model. This pattern is likely to become standard practice as video becomes a primary modality in AI applications.