*Equal contribution, listed in alphabetical order
†Corresponding author: Qin Jin (qjin@ruc.edu.cn)
Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their ability to generalize remains limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend more difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small but comprehensive and balanced benchmark suitable for LVLM evaluation, which is sourced from available public benchmarks. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using significantly less training data than prior LVLM approaches, while improving its general video understanding capabilities.
Figure: Overview of Time-R1 and its improved case for temporal video grounding, short video QA and long video QA.
The TVG task aims to temporally localize video segments within long-form videos based on natural language queries. Given a video of duration \( t \) seconds, which is represented as a sequence of \( T \) frames \( \{x_1,\dots,x_T\} \), and a language query \( q \), the goal is to identify the temporal boundaries \( [t_s, t_e] \) of the segment that best corresponds to \( q \), where \( t_s, t_e \in \mathbb{R}^+ \). In this work, we introduce Time-R1, a framework designed to unleash the potential of LVLMs for the TVG task using reinforcement learning (RL).
We evaluate the temporal video grounding (TVG) performance of Time-R1 across three benchmarks: Charades-STA, ActivityNet Captions, and our proposed TVGBench.
In gray\( ^*\), we denote models that are fine-tuned on each benchmark, while black indicates zero-shot performance. Our comparisons cover existing open-source 7B LVLMs, as well as state-of-the-art video-language pretraining (VLP) models.
As shown in Table 1, Time-R1 achieves the highest accuracy among LVLM-based methods in both fine-tuned and zero-shot scenarios. Notably, on TVGBench, our method surpasses all compared baselines by a large margin, validating its effectiveness in temporal grounding and general video understanding.
We compare different post-training paradigms for large vision-language models (LVLMs) across multiple tasks, including short video QA, long video QA, and temporal video grounding (TVG).
The methods labeled as SFT and RL represent full finetuning of the language model, whereas SFT-LoRA refers to finetuning with LoRA. The baseline model used for comparison is Qwen2.5-VL-7B.
The results highlight that our RL-based approach consistently improves performance across tasks compared to supervised finetuning and LoRA finetuning, demonstrating the effectiveness of reinforcement learning with verifiable rewards in aligning LVLMs to temporal video understanding.
both Gaussian Filtering (GF) and Multi-Epoch training (ME) individually improve performance, with ME yielding a more substantial gain, improving from R1@0.7 of 13.2 in row 1 to 14.2 in row 4. Notably, the combination of tIoU supervision and ME (Row 6) leads to a significant boost across all metrics. As more components are added, GF and ME (Row 7), followed by Sample Filtering (SF) in Row 8, the performance continues to improve, ultimately reaching R1@0.5 of 29.4 and R1@0.7 of 16.4.