Skip to content

timescope #2961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions _blog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6339,3 +6339,15 @@
- robotics
- models
- open-source

- local: timescope
title: "TimeScope: How Long Can Your Video-LMM go?"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never seen LMM abbreviation tbh it might be good to write long, it also helps with SEO

author: orrzohar
thumbnail: /blog/assets/timescope/thumbnail.png
date: Jul 14, 2025
tags:
- video
- datasets
- multimodal
- open-source
- benchmark
Binary file added assets/timescope/thumbnail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
96 changes: 96 additions & 0 deletions timescope.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
title: "TimeScope: How Long Can Your Video-LMM go?"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to keep the filename long for SEO

thumbnail: /blog/assets/timescope/thumbnail.png
authors:
- user: orrzohar
guest: true
org: Stanford
- user: ruili0
guest: true
org: Stanford
- user: andito
guest: false
org: HuggingFace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
org: HuggingFace
org: huggingface

- user: nicholswang
guest: true
org: Stanford
---

# TimeScope: How Long Can Your Video-LMM go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nice to add a TLDR; and a table-of-contents at the beginning

As multimodal AI continues to advance, recent models have begun to make ambitious claims regarding their capacity for “hour-long video understanding.” Drawing inspiration from progress in long-context language models—where extended reasoning over lengthy textual inputs has become increasingly feasible—vision-language systems now advertise context windows encompassing thousands of frames. However, these developments prompt a critical inquiry: to what extent do such models demonstrate genuine temporal comprehension, as opposed to surface-level pattern recognition or overstated capabilities?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sentences are a bit too long and sophisticated, would be great to simplify them. people will quit reading otherwise





Text benchmarks such as **HELM** and **RULER** have exposed the fragility of long-context claims, showing that models often falter when tasks demand more than simple retrieval, like reasoning or aggregation at long context lengths. In the video domain, however, we're still playing catch-up. The most common test, **Video Needle in a Haystack (VideoNIAH)**, injects static *images* as "needles" into videos, effectively measuring visual search rather than true temporal dynamics. As a result, even top-tier models advertising massive frame capacities are rarely trained beyond ~256 frames and see sharp drops on benchmarks like **Video-MME** when pushed further.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nice to link them to HF datasets


This measurement gap leaves us wondering: What does it really mean for a model to "understand" long videos? To address this, we're excited to introduce **TimeScope**, a new open-source benchmark hosted on Hugging Face. TimeScope probes the limits of long-video capabilities by inserting short (~5-10 second) *video clips*—our "needles"—into base videos ranging from 1 minute to 8 hours. With three distinct task types, it evaluates not just retrieval but synthesis, localization, and fine-grained motion analysis, providing a more holistic view of temporal comprehension.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice!


<script type="module" src="https://gradio.s3-us-west-2.amazonaws.com/4.4.0/gradio.js"></script>
<gradio-app theme_mode="dark" space="Apollo-LMMs/TimeScope"></gradio-app>

## Why TimeScope? Motivating a Better Benchmark for Video

The promise of long-video AI is transformative — enabling agents to summarize hours of footage, detect subtle anomalies, and answer complex questions about extended narratives. Integrated into robotics, these models could analyze prolonged operations, adapt in real time, and drive autonomous decision-making. Just as powerful is the vision of a personal assistant that understands daily life and offers continuous, actionable feedback.



In practice, this leads to overstated capabilities. Models might claim to process 10,000+ frames, but training data often caps at 256 frames per clip, leading to degraded performance on longer inputs. We've seen this in evaluations where increasing frame sampling rates tanks accuracy on tasks requiring temporal insight.

TimeScope flips the script by emphasizing three pillars of long-video understanding:
1. **Localized Retrieval**: Can the model spot and answer questions about a specific short segment within a vast video?
2. **Information Synthesis**: Can it gather and order details from multiple points across the timeline?
3. **Temporal Perception**: Can it analyze motion and events in a needles that demand dense, multi-frame sampleing?


## Benchmark Design

TimeScope's core innovation lies in its utilization of
video clips as needles, requiring more then simply sampling that particular needle in the video but rather to densely understand the entire video. We start with a long base video (e.g., a documentary, lecture, or ambient footage) and insert one or more hand-curated short video needles (5-10 seconds each) at random positions. These needles contain the key information needed to solve the task, forcing models to process the entire input without shortcuts like sparse sampling.


<img src="https://huggingface.co/spaces/Apollo-LMMs/TimeScope/resolve/main/overview.png" alt="Benchmark Design Diagram" style="width: 90%; height: auto;">


*Figure 1: Overview of TimeScope's needle insertion process. A long base video (1 min to 8 hours) serves as the haystack, into which we splice short video needles (~10-20 seconds). Tasks require detecting, synthesizing, or analyzing content from these needles, embedded at varying depths.*

We evaluate across three needle types, each targeting a different aspect of long-video comprehension:

### 1. General QA
This tests basic retrieval and understanding of a localized event. Questions are crafted such that sampling a single relevant frame from the needle should suffice—mimicking queries about a brief segment in a longer video.

Example:
What mode of transportation is shown in the video?

<video controls>
<source src="https://huggingface.co/spaces/Apollo-LMMs/TimeScope/resolve/main/train.mp4" type="video/mp4">
</video>

### 2. Information Synthesis (OCR QA)
Here, we embed multiple text-based needles (e.g., 2-4 short clips displaying "secret words" via on-screen text) at different points in the video. The model must identify all words and report them in chronological order, simulating tasks like extracting timestamps or key facts from dispersed scenes. This requires scanning the full timeline and understanding relative positioning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nice to add example again


### 3. Fine-Grained Temporal Understanding
For questions focusing on motion or sequences within a short clip, single-frame sampling won't cut it—the model needs to perceive dynamics across frames. This probes whether long-context handling preserves temporal fidelity.

Example:
How many times did the man swing his axe? (a) one (b) two (c) three (d) four (e) five (f) six

<video controls>
<source src="https://huggingface.co/spaces/Apollo-LMMs/TimeScope/resolve/main/temporal_wood_cutting.mp4" type="video/mp4">
</video>

By varying video lengths and needle placements, TimeScope quantifies the maximum duration a model can "reasonably" claim to understand, highlighting drop-offs in performance as contexts grow.

## Baseline Evaluation Results

To kick things off, we ran TimeScope on a suite of leading vision-language models, from open-source favorites to the juggernaut: Gemini2.5-pro. The results underscore the benchmark’s value: even models with advertised long-context prowess struggle with authentic temporal tasks at scale. These findings reveal clear patterns—performance cliffs around certain durations, strengths in static retrieval versus weaknesses in motion analysis—and pave the way for targeted improvements in model training. For detailed results and visualizations, check out our Hugging Face Space embedded above.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To kick things off, we ran TimeScope on a suite of leading vision-language models, from open-source favorites to the juggernaut: Gemini2.5-pro. The results underscore the benchmark’s value: even models with advertised long-context prowess struggle with authentic temporal tasks at scale. These findings reveal clear patterns—performance cliffs around certain durations, strengths in static retrieval versus weaknesses in motion analysis—and pave the way for targeted improvements in model training. For detailed results and visualizations, check out our Hugging Face Space embedded above.
To kick things off, we ran TimeScope on a suite of leading vision-language models, from open-source favorites to the juggernauts like Gemini2.5-Pro. The results underscore the benchmark’s value: even models with advertised long-context prowess struggle with authentic temporal tasks at scale. These findings reveal clear patterns—performance cliffs around certain durations, strengths in static retrieval versus weaknesses in motion analysis—and pave the way for targeted improvements in model training. For detailed results and visualizations, check out our Hugging Face Space embedded above.



## Open-Sourcing

We are open-sourcing all components of TimeScope:

- **Dataset**: [Apollo-LMMs/TimeScope](https://huggingface.co/datasets/Apollo-LMMs/TimeScope)
- **Leaderboard**: [Apollo-LMMs/TimeScope](https://huggingface.co/spaces/Apollo-LMMs/TimeScope)
- **Evaluation Framework**: [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to end with a call for action below