timescope #2961

orrzohar · 2025-07-14T03:33:06Z

Congratulations! You've made it this far! Once merged, the article will appear at https://huggingface.co/blog. Official articles
require additional reviews. Alternatively, you can write a community article following the process here.

Preparing the Article

You're not quite done yet, though. Please make sure to follow this process (as documented here):

[y] Add an entry to _blog.yml.
[y] Add a thumbnail. There are no requirements here, but there is a template if it's helpful.
[y] Check you use a short title and blog path.
[y] Upload any additional assets (such as images) to the Documentation Images repo. This is to reduce bloat in the GitHub base repo when cloning and pulling. Try to have small images to avoid a slow or expensive user experience.
[y] Add metadata (such as authors) to your md file. You can also specify guest or org for the authors.
[y] Ensure the publication date is correct.
[y] Preview the content. A quick way is to paste the markdown content in https://huggingface.co/new-blog. Do not click publish, this is just a way to do an early check.

Here is an example of a complete PR: #2382

Getting a Review

Please make sure to get a review from someone on your team or a co-author.
Once this is done and once all the steps above are completed, you should be able to merge.
There is no need for additional reviews if you and your co-authors are happy and meet all of the above.

Feel free to add @pcuenca as a reviewer if you want a final check. Keep in mind he'll be biased toward light reviews
(e.g., check for proper metadata) rather than content reviews unless explicitly asked.

orrzohar · 2025-07-14T03:34:42Z

note: hf space renders in local md, but not: https://huggingface.co/new-blog

is that normal? @andimarafioti

merveenoyan

did an initial pass, super nice!

merveenoyan · 2025-07-16T08:11:24Z

_blog.yml

@@ -6339,3 +6339,15 @@
    - robotics
    - models
    - open-source
+
+- local: timescope
+  title: "TimeScope: How Long Can Your Video-LMM go?"


I have never seen LMM abbreviation tbh it might be good to write long, it also helps with SEO

merveenoyan · 2025-07-16T08:11:38Z

timescope.md

+  org: Stanford
+- user: andito
+  guest: false
+  org: HuggingFace


Suggested change

org: HuggingFace

org: huggingface

merveenoyan · 2025-07-16T08:11:51Z

timescope.md

@@ -0,0 +1,96 @@
+---
+title: "TimeScope: How Long Can Your Video-LMM go?"


it would be nice to keep the filename long for SEO

merveenoyan · 2025-07-16T08:12:37Z

timescope.md

+
+# TimeScope: How Long Can Your Video-LMM go?
+
+As multimodal AI continues to advance, recent models have begun to make ambitious claims regarding their capacity for “hour-long video understanding.” Drawing inspiration from progress in long-context language models—where extended reasoning over lengthy textual inputs has become increasingly feasible—vision-language systems now advertise context windows encompassing thousands of frames. However, these developments prompt a critical inquiry: to what extent do such models demonstrate genuine temporal comprehension, as opposed to surface-level pattern recognition or overstated capabilities?


the sentences are a bit too long and sophisticated, would be great to simplify them. people will quit reading otherwise

merveenoyan · 2025-07-16T08:13:18Z

timescope.md

+
+
+
+Text benchmarks such as **HELM** and **RULER** have exposed the fragility of long-context claims, showing that models often falter when tasks demand more than simple retrieval, like reasoning or aggregation at long context lengths. In the video domain, however, we're still playing catch-up. The most common test, **Video Needle in a Haystack (VideoNIAH)**, injects static *images* as "needles" into videos, effectively measuring visual search rather than true temporal dynamics. As a result, even top-tier models advertising massive frame capacities are rarely trained beyond ~256 frames and see sharp drops on benchmarks like **Video-MME** when pushed further.


might be nice to link them to HF datasets

merveenoyan · 2025-07-16T08:14:48Z

timescope.md

+
+Text benchmarks such as **HELM** and **RULER** have exposed the fragility of long-context claims, showing that models often falter when tasks demand more than simple retrieval, like reasoning or aggregation at long context lengths. In the video domain, however, we're still playing catch-up. The most common test, **Video Needle in a Haystack (VideoNIAH)**, injects static *images* as "needles" into videos, effectively measuring visual search rather than true temporal dynamics. As a result, even top-tier models advertising massive frame capacities are rarely trained beyond ~256 frames and see sharp drops on benchmarks like **Video-MME** when pushed further.
+
+This measurement gap leaves us wondering: What does it really mean for a model to "understand" long videos? To address this, we're excited to introduce **TimeScope**, a new open-source benchmark hosted on Hugging Face. TimeScope probes the limits of long-video capabilities by inserting short (~5-10 second) *video clips*—our "needles"—into base videos ranging from 1 minute to 8 hours. With three distinct task types, it evaluates not just retrieval but synthesis, localization, and fine-grained motion analysis, providing a more holistic view of temporal comprehension.


merveenoyan · 2025-07-16T08:16:33Z

timescope.md

+---
+
+# TimeScope: How Long Can Your Video-LMM go?
+


might be nice to add a TLDR; and a table-of-contents at the beginning

merveenoyan · 2025-07-16T08:17:14Z

timescope.md

+</video>
+
+### 2. Information Synthesis (OCR QA)
+Here, we embed multiple text-based needles (e.g., 2-4 short clips displaying "secret words" via on-screen text) at different points in the video. The model must identify all words and report them in chronological order, simulating tasks like extracting timestamps or key facts from dispersed scenes. This requires scanning the full timeline and understanding relative positioning.


might be nice to add example again

merveenoyan · 2025-07-16T08:18:06Z

timescope.md

+
+## Baseline Evaluation Results
+
+To kick things off, we ran TimeScope on a suite of leading vision-language models, from open-source favorites to the juggernaut: Gemini2.5-pro. The results underscore the benchmark’s value: even models with advertised long-context prowess struggle with authentic temporal tasks at scale. These findings reveal clear patterns—performance cliffs around certain durations, strengths in static retrieval versus weaknesses in motion analysis—and pave the way for targeted improvements in model training. For detailed results and visualizations, check out our Hugging Face Space embedded above.


Suggested change

To kick things off, we ran TimeScope on a suite of leading vision-language models, from open-source favorites to the juggernaut: Gemini2.5-pro. The results underscore the benchmark’s value: even models with advertised long-context prowess struggle with authentic temporal tasks at scale. These findings reveal clear patterns—performance cliffs around certain durations, strengths in static retrieval versus weaknesses in motion analysis—and pave the way for targeted improvements in model training. For detailed results and visualizations, check out our Hugging Face Space embedded above.

To kick things off, we ran TimeScope on a suite of leading vision-language models, from open-source favorites to the juggernauts like Gemini2.5-Pro. The results underscore the benchmark’s value: even models with advertised long-context prowess struggle with authentic temporal tasks at scale. These findings reveal clear patterns—performance cliffs around certain durations, strengths in static retrieval versus weaknesses in motion analysis—and pave the way for targeted improvements in model training. For detailed results and visualizations, check out our Hugging Face Space embedded above.

merveenoyan · 2025-07-16T08:18:33Z

timescope.md

+
+- **Dataset**: [Apollo-LMMs/TimeScope](https://huggingface.co/datasets/Apollo-LMMs/TimeScope)
+- **Leaderboard**: [Apollo-LMMs/TimeScope](https://huggingface.co/spaces/Apollo-LMMs/TimeScope)
+- **Evaluation Framework**: [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)


would be nice to end with a call for action below

orrzohar and others added 10 commits July 13, 2025 01:45

timescope

5811275

adding timescope

d20cfe7

Update timescope.md (typo)

4ffd033

Update _blog.yml

1872b82

Update timescope.md

17bdd94

Update timescope.md

3ffafab

Update _blog.yml

78bc27d

Update timescope.md

831fcb4

Update timescope.md

ddc5753

Update timescope.md

dee2bd8

merveenoyan reviewed Jul 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

timescope #2961

timescope #2961

orrzohar commented Jul 14, 2025

Uh oh!

orrzohar commented Jul 14, 2025

Uh oh!

merveenoyan left a comment

Uh oh!

merveenoyan Jul 16, 2025

Uh oh!

merveenoyan Jul 16, 2025

Uh oh!

merveenoyan Jul 16, 2025

Uh oh!

merveenoyan Jul 16, 2025

Uh oh!

merveenoyan Jul 16, 2025

Uh oh!

merveenoyan Jul 16, 2025

Uh oh!

merveenoyan Jul 16, 2025

Uh oh!

merveenoyan Jul 16, 2025

Uh oh!

merveenoyan Jul 16, 2025

Uh oh!

merveenoyan Jul 16, 2025

Uh oh!

Uh oh!

		@@ -0,0 +1,96 @@
		---
		title: "TimeScope: How Long Can Your Video-LMM go?"


		# TimeScope: How Long Can Your Video-LMM go?

		As multimodal AI continues to advance, recent models have begun to make ambitious claims regarding their capacity for “hour-long video understanding.” Drawing inspiration from progress in long-context language models—where extended reasoning over lengthy textual inputs has become increasingly feasible—vision-language systems now advertise context windows encompassing thousands of frames. However, these developments prompt a critical inquiry: to what extent do such models demonstrate genuine temporal comprehension, as opposed to surface-level pattern recognition or overstated capabilities?




		Text benchmarks such as HELM and RULER have exposed the fragility of long-context claims, showing that models often falter when tasks demand more than simple retrieval, like reasoning or aggregation at long context lengths. In the video domain, however, we're still playing catch-up. The most common test, Video Needle in a Haystack (VideoNIAH), injects static images as "needles" into videos, effectively measuring visual search rather than true temporal dynamics. As a result, even top-tier models advertising massive frame capacities are rarely trained beyond ~256 frames and see sharp drops on benchmarks like Video-MME when pushed further.


		Text benchmarks such as HELM and RULER have exposed the fragility of long-context claims, showing that models often falter when tasks demand more than simple retrieval, like reasoning or aggregation at long context lengths. In the video domain, however, we're still playing catch-up. The most common test, Video Needle in a Haystack (VideoNIAH), injects static images as "needles" into videos, effectively measuring visual search rather than true temporal dynamics. As a result, even top-tier models advertising massive frame capacities are rarely trained beyond ~256 frames and see sharp drops on benchmarks like Video-MME when pushed further.

		This measurement gap leaves us wondering: What does it really mean for a model to "understand" long videos? To address this, we're excited to introduce TimeScope, a new open-source benchmark hosted on Hugging Face. TimeScope probes the limits of long-video capabilities by inserting short (~5-10 second) video clips—our "needles"—into base videos ranging from 1 minute to 8 hours. With three distinct task types, it evaluates not just retrieval but synthesis, localization, and fine-grained motion analysis, providing a more holistic view of temporal comprehension.


		## Baseline Evaluation Results

		To kick things off, we ran TimeScope on a suite of leading vision-language models, from open-source favorites to the juggernaut: Gemini2.5-pro. The results underscore the benchmark’s value: even models with advertised long-context prowess struggle with authentic temporal tasks at scale. These findings reveal clear patterns—performance cliffs around certain durations, strengths in static retrieval versus weaknesses in motion analysis—and pave the way for targeted improvements in model training. For detailed results and visualizations, check out our Hugging Face Space embedded above.

timescope #2961

Are you sure you want to change the base?

timescope #2961

Conversation

orrzohar commented Jul 14, 2025

Preparing the Article

Getting a Review

Uh oh!

orrzohar commented Jul 14, 2025

Uh oh!

merveenoyan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!