Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and task diversity, or they focus solely on the end-to-end LVU performance, making them inappropriate for evaluating whether key moments can be accurately accessed. To address this challenge, we propose MomentSeeker, a novel benchmark for long-video moment retrieval (LMVR), distinguished by the following features. First, it is created based on long and diverse videos, averaging over 1200 seconds in duration and collected from various domains, e.g., movie, anomaly, egocentric, and sports. Second, it covers a variety of real-world scenarios in three levels: global-level, event-level, object-level, covering common tasks like action recognition, object localization, and causal reasoning, etc. Third, it incorporates rich forms of queries, including text-only queries, image-conditioned queries, and video-conditioned queries. On top of MomentSeeker, we conduct comprehensive experiments for both generation-based approaches (directly using MLLMs) and retrieval-based approaches (leveraging video retrievers). Our results reveal the significant challenges in long-video moment retrieval in terms of accuracy and efficiency, despite improvements from the latest long-video MLLMs and task-specific fine-tuning.
Examples of each task. Dashed boxes show sources of query image $q_I$ and video $q_V$; solid boxes mark ground truth moments. Red circles highlight key queried information.
Dataset statistics. (a). Question type distribution, (b). Video duration distribution across samples, and (c) Answering time range length distribution across samples. MomentSeeker has a full spectrum of video length and covers different core abilities of moment retrieval task.
Benchmark | Label | Moment-targeted? | Task-oriented? | Duration (s) | #Videos | #Queries | Domain |
---|---|---|---|---|---|---|---|
Moment retrieval benchmarks | |||||||
TVR[1] | Auto | ✔ | ✘ | 76.2 | 1090 | 5450 | TV show |
CharadesSTA[2] | Human | ✔ | ✘ | 30.6 | 1334 | 3720 | Activity |
THUMOS14[3] | Human | ✔ | ✘ | 186.4 | 216 | 3457 | Action |
QVHighlights[4] | Human | ✔ | ✔ | 150 | 476 | 1542 | Vlog/News |
LVU benchmarks | |||||||
VideoMME[5] | Human | ✘ | ✔ | 1021.3 | 900 | 2700 | YouTube |
MLVU[6] | Human | ✘ | ✔ | 905.8 | 349 | 502 | Open |
LongVideoBench[7] | Human | ✘ | ✔ | 574.9 | 753 | 1337 | Open |
V-NIAH[8] | Auto | ✘ | ✔ | - | - | 5 | Open |
MomentSeeker | Human | ✔ | ✔ | 1201.9 | 268 | 1800 | Open |
The MomentSeeker dataset and popular benchmarks for moment retrieval. We report statistics for the test set of each benchmark.
Main results across different meta-tasks. #Frames indicates the number of input frames for generation-based methods and per-clip frames for retrieval-based methods. † denotes tested on a random subset due to high cost.
Method | #Size | #Frames | Global-level | Event-level | Object-level | Overall | ||||
---|---|---|---|---|---|---|---|---|---|---|
R@1 | mAP@5 | R@1 | mAP@5 | R@1 | mAP@5 | R@1 | mAP@5 | |||
InternVideo2 | 1B | 8 | 16.8 | 24.5 | 23.5 | 30.9 | 17.0 | 22.7 | 19.7 | 26.6 |
LanguageBind | 428M | 8 | 16.2 | 24.6 | 21.4 | 29.4 | 15.5 | 21.0 | 18.2 | 25.4 |
E5V | 8.4B | 1 | 13.1 | 19.5 | 14.5 | 20.7 | 14.9 | 19.8 | 14.3 | 20.1 |
MM-Ret | 148M | 1 | 14.2 | 17.9 | 13.6 | 19.4 | 9.7 | 15.4 | 12.4 | 17.7 |
CoVR | 588M | 15 | 9.8 | 15.4 | 13.7 | 19.9 | 14.4 | 18.9 | 13.0 | 18.5 |
UniIR | 428M | 1 | 14.9 | 19.4 | 11.5 | 17.9 | 8.2 | 13.9 | 11.2 | 16.9 |
Method | #Size | #Frames | Global-level | Event-level | Object-level | Overall | ||||
---|---|---|---|---|---|---|---|---|---|---|
R@1 | mAP@5 | R@1 | mAP@5 | R@1 | mAP@5 | R@1 | mAP@5 | |||
Qwen2.5VL | 72B | 768 | 13.6 | 13.0 | 21.9 | 21.8 | 12.2 | 11.9 | 17.2 | 16.9 |
GPT-4o(2024-11-20)† | - | 128 | 12.5 | 12.5 | 20.8 | 20.8 | 12.1 | 12.5 | 16.4 | 16.5 |
InternVL3 | 38B | 96 | 11.1 | 10.5 | 20.8 | 21.2 | 11.3 | 11.5 | 15.8 | 16.0 |
Qwen2.5VL | 7B | 768 | 4.6 | 3.8 | 12.0 | 12.2 | 4.3 | 4.2 | 8.1 | 8.0 |
LLaVA-Video | 72B | 96 | 3.6 | 3.5 | 8.6 | 9.8 | 4.6 | 5.6 | 6.3 | 7.2 |
TimeChat | 7B | 96 | 2.6 | 2.6 | 6.7 | 6.7 | 4.4 | 4.4 | 5.9 | 5.9 |
Lita | 13B | 100 | 2.6 | 2.6 | 7.2 | 7.2 | 1.8 | 1.8 | 5.6 | 5.6 |
InternVL3 | 8B | 96 | 3.9 | 3.5 | 7.8 | 8.5 | 4.1 | 4.1 | 5.9 | 6.1 |
If you have any questions about this benchmark, feel free to contact hyyuan@ruc.edu.cn.
If you find our work useful, please consider citing our paper:
@misc{yuan2025momentseekercomprehensivebenchmarkstrong,
title={MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos},
author={Huaying Yuan and Jian Ni and Yueze Wang and Junjie Zhou and Zhengyang Liang and Zheng Liu and Zhao Cao and Zhicheng Dou and Ji-Rong Wen},
year={2025},
eprint={2502.12558},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.12558},
}