-
Notifications
You must be signed in to change notification settings - Fork 884
timescope #2961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
timescope #2961
Conversation
note: hf space renders in local md, but not: https://huggingface.co/new-blog is that normal? @andimarafioti |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did an initial pass, super nice!
@@ -6339,3 +6339,15 @@ | |||
- robotics | |||
- models | |||
- open-source | |||
|
|||
- local: timescope | |||
title: "TimeScope: How Long Can Your Video-LMM go?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have never seen LMM abbreviation tbh it might be good to write long, it also helps with SEO
org: Stanford | ||
- user: andito | ||
guest: false | ||
org: HuggingFace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org: HuggingFace | |
org: huggingface |
@@ -0,0 +1,96 @@ | |||
--- | |||
title: "TimeScope: How Long Can Your Video-LMM go?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to keep the filename long for SEO
|
||
# TimeScope: How Long Can Your Video-LMM go? | ||
|
||
As multimodal AI continues to advance, recent models have begun to make ambitious claims regarding their capacity for “hour-long video understanding.” Drawing inspiration from progress in long-context language models—where extended reasoning over lengthy textual inputs has become increasingly feasible—vision-language systems now advertise context windows encompassing thousands of frames. However, these developments prompt a critical inquiry: to what extent do such models demonstrate genuine temporal comprehension, as opposed to surface-level pattern recognition or overstated capabilities? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the sentences are a bit too long and sophisticated, would be great to simplify them. people will quit reading otherwise
|
||
|
||
|
||
Text benchmarks such as **HELM** and **RULER** have exposed the fragility of long-context claims, showing that models often falter when tasks demand more than simple retrieval, like reasoning or aggregation at long context lengths. In the video domain, however, we're still playing catch-up. The most common test, **Video Needle in a Haystack (VideoNIAH)**, injects static *images* as "needles" into videos, effectively measuring visual search rather than true temporal dynamics. As a result, even top-tier models advertising massive frame capacities are rarely trained beyond ~256 frames and see sharp drops on benchmarks like **Video-MME** when pushed further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be nice to link them to HF datasets
|
||
Text benchmarks such as **HELM** and **RULER** have exposed the fragility of long-context claims, showing that models often falter when tasks demand more than simple retrieval, like reasoning or aggregation at long context lengths. In the video domain, however, we're still playing catch-up. The most common test, **Video Needle in a Haystack (VideoNIAH)**, injects static *images* as "needles" into videos, effectively measuring visual search rather than true temporal dynamics. As a result, even top-tier models advertising massive frame capacities are rarely trained beyond ~256 frames and see sharp drops on benchmarks like **Video-MME** when pushed further. | ||
|
||
This measurement gap leaves us wondering: What does it really mean for a model to "understand" long videos? To address this, we're excited to introduce **TimeScope**, a new open-source benchmark hosted on Hugging Face. TimeScope probes the limits of long-video capabilities by inserting short (~5-10 second) *video clips*—our "needles"—into base videos ranging from 1 minute to 8 hours. With three distinct task types, it evaluates not just retrieval but synthesis, localization, and fine-grained motion analysis, providing a more holistic view of temporal comprehension. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice!
--- | ||
|
||
# TimeScope: How Long Can Your Video-LMM go? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be nice to add a TLDR; and a table-of-contents at the beginning
</video> | ||
|
||
### 2. Information Synthesis (OCR QA) | ||
Here, we embed multiple text-based needles (e.g., 2-4 short clips displaying "secret words" via on-screen text) at different points in the video. The model must identify all words and report them in chronological order, simulating tasks like extracting timestamps or key facts from dispersed scenes. This requires scanning the full timeline and understanding relative positioning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be nice to add example again
|
||
## Baseline Evaluation Results | ||
|
||
To kick things off, we ran TimeScope on a suite of leading vision-language models, from open-source favorites to the juggernaut: Gemini2.5-pro. The results underscore the benchmark’s value: even models with advertised long-context prowess struggle with authentic temporal tasks at scale. These findings reveal clear patterns—performance cliffs around certain durations, strengths in static retrieval versus weaknesses in motion analysis—and pave the way for targeted improvements in model training. For detailed results and visualizations, check out our Hugging Face Space embedded above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To kick things off, we ran TimeScope on a suite of leading vision-language models, from open-source favorites to the juggernaut: Gemini2.5-pro. The results underscore the benchmark’s value: even models with advertised long-context prowess struggle with authentic temporal tasks at scale. These findings reveal clear patterns—performance cliffs around certain durations, strengths in static retrieval versus weaknesses in motion analysis—and pave the way for targeted improvements in model training. For detailed results and visualizations, check out our Hugging Face Space embedded above. | |
To kick things off, we ran TimeScope on a suite of leading vision-language models, from open-source favorites to the juggernauts like Gemini2.5-Pro. The results underscore the benchmark’s value: even models with advertised long-context prowess struggle with authentic temporal tasks at scale. These findings reveal clear patterns—performance cliffs around certain durations, strengths in static retrieval versus weaknesses in motion analysis—and pave the way for targeted improvements in model training. For detailed results and visualizations, check out our Hugging Face Space embedded above. |
|
||
- **Dataset**: [Apollo-LMMs/TimeScope](https://huggingface.co/datasets/Apollo-LMMs/TimeScope) | ||
- **Leaderboard**: [Apollo-LMMs/TimeScope](https://huggingface.co/spaces/Apollo-LMMs/TimeScope) | ||
- **Evaluation Framework**: [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to end with a call for action below
Congratulations! You've made it this far! Once merged, the article will appear at https://huggingface.co/blog. Official articles
require additional reviews. Alternatively, you can write a community article following the process here.
Preparing the Article
You're not quite done yet, though. Please make sure to follow this process (as documented here):
md
file. You can also specifyguest
ororg
for the authors.Here is an example of a complete PR: #2382
Getting a Review
Please make sure to get a review from someone on your team or a co-author.
Once this is done and once all the steps above are completed, you should be able to merge.
There is no need for additional reviews if you and your co-authors are happy and meet all of the above.
Feel free to add @pcuenca as a reviewer if you want a final check. Keep in mind he'll be biased toward light reviews
(e.g., check for proper metadata) rather than content reviews unless explicitly asked.