TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

Boshen Xu^1*‡, Zihan Xiao^1*, Jiaze Li², Jianzhong Ju², Zhenbo Luo², Jian Luan², Qin Jin^1†,

^*Equal contribution; ^‡Project Lead

^†Corresponding author: Qin Jin (qjin@ruc.edu.cn)

¹AIM3 Lab, Renmin University of China, ²MiLM Plus, Xiaomi Inc.

arxiv, 2025

Abstract

We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms.

Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from visual tokens to textual tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities.

This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending input length. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability.

This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.

Overall Contributions and Illustration

Figure: Overview of Time-R1 and its improved case for temporal video grounding, short video QA and long video QA.

We present TimeViper, a hybrid Mamba-Transformer vision-language model for efficient long video understanding. We reveal the severe vision token redundancy and a vision-to-text information aggregation phenomenon in hybrid models.

To this end, we introduce TransV, the first token-transfer module that compresses vision tokens into text tokens inside the LLM. Benefitting from the Mamba layers' O(n) computation and O(1) cache cost, TimeViper generates 40.1% more tokens per second than Qwen3 when processing 32k input tokens (approximately 2k frames at 16 tokens per frame) and producing 1k output tokens with batch size 32.

Across public benchmarks, TimeViper delivers performance competitive with current Transformer-based MLLMs, including multi-choice QA on VideoMME, temporal video grounding on Charades, video detailed captioning on VDC, and hour-long video understanding on LVBench.

Method Overview

Illustration of TimeViper, our proposed hybrid MLLM for long video understanding. The model consists of a ViT visual encoder, a projector with token merging, and a hybrid Mamba-Transformer LLM equipped with TransV. The token merging compresses each frame into 16 vision tokens.

Inside the LLM, TransV transfers information from redundant vision tokens to instruction tokens to reduce the number of vision tokens. Specifically, TransV uniformly drops vision tokens in shallow layers and removes low-attention vision tokens in deeper layers. The compression module is implemented through a Gated Cross-Attention mechanism with adaptive learnable weights. Note that TransV is illustrated before the attention layer for clarity, though it may be applied before any layer in practice.

Main Quantitative Results

TimeViper achieves competitive performance with state-of-the-art models across video understanding benchmarks.

For MCQ tasks, despite not fine-tuning ViT, TimeViper with TransV achieves an average accuracy of 55.9 on VideoMME, +0.4 points higher than Video-XL (55.5), which compresses tokens into new ones within Qwen2.

For VDC task, TimeViper achieves strong performance with an accuracy of 39.7, exceeding the task-specific model Auroracap by +39.1 points.

For TVG task, TimeViper establishes a surprisingly strong baseline of 42.6 mIoU on Charades, significantly outperforming the task-specific model VTimeLLM-13B, which achieves 34.6 mIoU. This is particularly notable because TimeViper uses only SigLIP positional embedding for vision tokens and relies on the implicit temporal modeling of Mamba layers. Yet the model learns robust temporal alignments between videos and language query, matching or exceeding prior models such as Qwen2.5-VL-7B that explicitly employ MRoPE for fine-grained timestamp modeling.

These results collectively demonstrate that hybrid Mamba-Transformer architectures are highly competitive for long video understanding.

Attention Score Matrix Analysis

Illustration of attention score matrices in Nanov2 and Qwen2.5 at shallow and deep layers. White lines divide the input sequence into four distinct segments: system prompt, vision tokens, user instruction, and the generated response.

Qualitative Case Studies

Qualitative results of TimeViper on three long video understanding tasks. (1) MCQ: The model demonstrates reasoning capability by correctly answering a multi-choice question about the video's content. (2) TVG: It accurately localizes the temporal boundaries for a specific event, achieving an IoU of 0.75. (3) VDC: The model generates a detailed description that showcases its fine-grained comprehension. Green text highlights accurate detailed descriptions. Some output in the middle is omitted for brevity.

BibTeX

@article{xu2025timeviper,
  title={TimeViper: A Hybrid Mamba-Transformer Model for Efficient Long Video Understanding},
  author={Xu, Boshen and Xiao, Zihan and Li, Jiaze and Ju, Jianzhong and Luo, Zhenbo and Luan, Jian and Jin, Qin},
  journal={arXiv preprint arXiv:2511.16595},
  year={2025}
}