Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

*Equal contribution, listed in alphabetical order

Corresponding author: Qin Jin (qjin@ruc.edu.cn)

1AIM3 Lab, Renmin University of China, 2MiLM Plus, Xiaomi Inc. 3Beijing University of Posts and Telecommunications
arxiv, 2025

Abstract

Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their ability to generalize remains limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend more difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small but comprehensive and balanced benchmark suitable for LVLM evaluation, which is sourced from available public benchmarks. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using significantly less training data than prior LVLM approaches, while improving its general video understanding capabilities.

Overall Contributions and Illustration

Method Overview

Figure: Overview of Time-R1 and its improved case for temporal video grounding, short video QA and long video QA.

The TVG task aims to temporally localize video segments within long-form videos based on natural language queries. Given a video of duration \( t \) seconds, which is represented as a sequence of \( T \) frames \( \{x_1,\dots,x_T\} \), and a language query \( q \), the goal is to identify the temporal boundaries \( [t_s, t_e] \) of the segment that best corresponds to \( q \), where \( t_s, t_e \in \mathbb{R}^+ \). In this work, we introduce Time-R1, a framework designed to unleash the potential of LVLMs for the TVG task using reinforcement learning (RL).

Experiments

Comparison of SoTA on TVG benchmarks

We evaluate the temporal video grounding (TVG) performance of Time-R1 across three benchmarks: Charades-STA, ActivityNet Captions, and our proposed TVGBench.

In gray\( ^*\), we denote models that are fine-tuned on each benchmark, while black indicates zero-shot performance. Our comparisons cover existing open-source 7B LVLMs, as well as state-of-the-art video-language pretraining (VLP) models.

Performance comparison on TVG benchmarks
Table 1: Temporal video grounding performance on Charades-STA, ActivityNet, and TVGBench. Models marked in gray$^*$ are fine-tuned on corresponding benchmarks.

As shown in Table 1, Time-R1 achieves the highest accuracy among LVLM-based methods in both fine-tuned and zero-shot scenarios. Notably, on TVGBench, our method surpasses all compared baselines by a large margin, validating its effectiveness in temporal grounding and general video understanding.

Comparison of Post-Training Paradigms

We compare different post-training paradigms for large vision-language models (LVLMs) across multiple tasks, including short video QA, long video QA, and temporal video grounding (TVG).

The methods labeled as SFT and RL represent full finetuning of the language model, whereas SFT-LoRA refers to finetuning with LoRA. The baseline model used for comparison is Qwen2.5-VL-7B.

Comparison of Post-Training Paradigms
Table 2: Performance comparison between post-training paradigms (SFT, RL, SFT-LoRA) on short video QA, long video QA, and TVG tasks.

The results highlight that our RL-based approach consistently improves performance across tasks compared to supervised finetuning and LoRA finetuning, demonstrating the effectiveness of reinforcement learning with verifiable rewards in aligning LVLMs to temporal video understanding.

Ablation Study

both Gaussian Filtering (GF) and Multi-Epoch training (ME) individually improve performance, with ME yielding a more substantial gain, improving from R1@0.7 of 13.2 in row 1 to 14.2 in row 4. Notably, the combination of tIoU supervision and ME (Row 6) leads to a significant boost across all metrics. As more components are added, GF and ME (Row 7), followed by Sample Filtering (SF) in Row 8, the performance continues to improve, ultimately reaching R1@0.5 of 29.4 and R1@0.7 of 16.4.

Ablation Study of Post-Training Paradigms
Table 2: Ablation of Time-R1-7B trainning. GF, ME, SF refers to Gaussian Filtering, Multi-Epoch, and Sample Filtering per epoch, respectively.