Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang^1*, Ziheng Wang^1*, Boshen Xu^1*‡, Yang Du¹, Kejun Lin¹, Zihan Xiao³, Zihao Yue¹, Jianzhong Ju², Liang Zhang¹, Dingyi Yang¹, Xiangnan Fang¹, Zewen He², Zhenbo Luo², Wenxuan Wang¹, Junqi Lin², Jian Luan², Qin Jin^1†,

^*Equal contribution, listed in alphabetical order; ^‡Project lead;

^†Corresponding author: Qin Jin (qjin@ruc.edu.cn)

¹AIM3 Lab, Renmin University of China, ²MiLM Plus, Xiaomi Inc. ³Independent Researcher

arxiv, 2025

Paper Code Data & Models

Abstract

Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their ability to generalize remains limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend more difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small but comprehensive and balanced benchmark suitable for LVLM evaluation, which is sourced from available public benchmarks. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using significantly less training data than prior LVLM approaches, while improving its general video understanding capabilities.

Overall Contributions and Illustration

Figure: Overview of Time-R1 and its improved case for temporal video grounding, short video QA and long video QA.

The TVG task aims to temporally localize video segments within long-form videos based on natural language queries. Given a video of duration $ t $ seconds, which is represented as a sequence of $ T $ frames $ \{x_1,\dots,x_T\} $, and a language query $ q $, the goal is to identify the temporal boundaries $ [t_s, t_e] $ of the segment that best corresponds to $ q $, where $ t_s, t_e \in \mathbb{R}^+ $. In this work, we introduce Time-R1, a framework designed to unleash the potential of LVLMs for the TVG task using reinforcement learning (RL).

RL-based framework for temporal video grounding. We introduce Time-R1, a reasoning-enhanced post-training framework via RL with verifiable rewards, where the LVLM first generates chain-of-thought descriptions and then predicts timestamps. The post-training process is optimized using Generalized Reinforcement Policy Optimization (GRPO) with a novel reward function, incorporating both a structured template reward and a timestamp-aware tIoU reward.
Time-aware reinforcement fine-tuning. We propose TimeRFT, a reinforcement fine-tuning strategy with dynamic hard sampling, which mines hard samples on a curated dataset and progressively selects low-IoU samples for multi-epoch training. To ensure stable reasoning and reduce hallucinations, we adopt a cold-start approach to generate CoT with video captions. To support RL-friendly training, we curate an RFT dataset with difficulty annotations on the TVG task.
Comprehensive benchmark for LVLMs on TVG. Existing TVG benchmarks are designed for the large-scale evaluation of small models. Considering the inference speed bottlenecks and general-purpose role of LVLMs, we construct TVGBench, a compact yet comprehensive benchmark for TVG. We carefully balance the video distribution, query distribution, and design specific query semantics to ensure that the benchmark is well-suited for evaluating LVLMs.
State-of-the-Art results and generalization. Compared with 7B LVLMs on the temporal video grounding task, our method outperforms all prior SFT-based methods. After fine-tuning on downstream benchmarks, it surpasses many previous feature-based approaches. Furthermore, Time-R1 also improves the model's general video understanding on video QA benchmarks.

Experiments

Comparison of SoTA on TVG benchmarks

We evaluate the temporal video grounding (TVG) performance of Time-R1 across three benchmarks: Charades-STA, ActivityNet Captions, and our proposed TVGBench.

In gray$ ^*$, we denote models that are fine-tuned on each benchmark, while black indicates zero-shot performance. Our comparisons cover existing open-source 7B LVLMs, as well as state-of-the-art video-language pretraining (VLP) models.

Performance comparison on TVG benchmarks — **Table 1:** Temporal video grounding performance on Charades-STA, ActivityNet, and TVGBench. Models marked in gray$^*$ are fine-tuned on corresponding benchmarks.

As shown in Table 1, Time-R1 achieves the highest accuracy among LVLM-based methods in both fine-tuned and zero-shot scenarios. Notably, on TVGBench, our method surpasses all compared baselines by a large margin, validating its effectiveness in temporal grounding and general video understanding.

Comparison of Post-Training Paradigms

We compare different post-training paradigms for large vision-language models (LVLMs) across multiple tasks, including short video QA, long video QA, and temporal video grounding (TVG).

The methods labeled as SFT and RL represent full finetuning of the language model, whereas SFT-LoRA refers to finetuning with LoRA. The baseline model used for comparison is Qwen2.5-VL-7B.

Comparison of Post-Training Paradigms — **Table 2:** Performance comparison between post-training paradigms (SFT, RL, SFT-LoRA) on short video QA, long video QA, and TVG tasks.

The results highlight that our RL-based approach consistently improves performance across tasks compared to supervised finetuning and LoRA finetuning, demonstrating the effectiveness of reinforcement learning with verifiable rewards in aligning LVLMs to temporal video understanding.

Ablation Study

both Gaussian Filtering (GF) and Multi-Epoch training (ME) individually improve performance, with ME yielding a more substantial gain, improving from R1@0.7 of 13.2 in row 1 to 14.2 in row 4. Notably, the combination of tIoU supervision and ME (Row 6) leads to a significant boost across all metrics. As more components are added, GF and ME (Row 7), followed by Sample Filtering (SF) in Row 8, the performance continues to improve, ultimately reaching R1@0.5 of 29.4 and R1@0.7 of 16.4.

BibTeX

@article{wang2025timer1,
      title={Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding}, 
      author={Wang, Ye and Wang, Ziheng and Xu, Boshen and Du, Yang and Lin, Kejun and Xiao, Zihan and Yue, Zihao and Ju, Jianzhong and Zhang, Liang and Yang, Dingyi and Fang, Xiangnan and He, Zewen and Luo, Zhenbo and Wang, Wenxuan and Lin, Junqi and Luan, Jian and Jin, Qin},
      journal={arXiv preprint arXiv:2503.13377},
      year={2025},
}