Main Quantitative Results
TimeViper achieves competitive performance with state-of-the-art models across video understanding benchmarks.
For MCQ tasks, despite not fine-tuning ViT, TimeViper with TransV achieves an average accuracy of 55.9 on VideoMME, +0.4 points higher than Video-XL (55.5), which compresses tokens into new ones within Qwen2.
For VDC task, TimeViper achieves strong performance with an accuracy of 39.7, exceeding the task-specific model Auroracap by +39.1 points.
For TVG task, TimeViper establishes a surprisingly strong baseline of 42.6 mIoU on Charades, significantly outperforming the task-specific model VTimeLLM-13B, which achieves 34.6 mIoU. This is particularly notable because TimeViper uses only SigLIP positional embedding for vision tokens and relies on the implicit temporal modeling of Mamba layers. Yet the model learns robust temporal alignments between videos and language query, matching or exceeding prior models such as Qwen2.5-VL-7B that explicitly employ MRoPE for fine-grained timestamp modeling.
These results collectively demonstrate that hybrid Mamba-Transformer architectures are highly competitive for long video understanding.