Boshen Xu

About

My research interests include video understanding (e.g., action recognition, vision-language pretraining, long-form video grounding, etc.) and embodied AI (e.g., 3D object assembly, 3D HOI reconstruction, etc.). I'm currently focusing on egocentric vision and related topics that benefit VR/AR/Embodied AI.

I am a second-year PhD student at Renmin University of China (RUC) under the supervision of Professor Qin Jin at AIM3 Lab. Prior to joining RUC, I got my bachelor degree from School of Computer Science and Engineering, University of Electronic Science and Technology of China (UESTC).

Email  /  CV  /  Github  /  Google Scholar

profile photo

Publications

* denotes equal contributions.
Time-R1 Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Ye Wang*, Ziheng Wang*, Boshen Xu*, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, Xiangnan Fang, Zewen He, Zhenbo Luo, Wenxuan Wang, Junqi Lin, Jian Luan, Qin Jin
arxiv, 2025
project page / code (to release soon) / arxiv

We introduce Time-R1 framework, TimeRFT training, and TVGBench for LVLM evaluation, to advance the field of using LVLM for temporal video grounding. Time-R1 achieves state-of-the-art performance on TVG using only 2.5K data for RL fine-tuning, with improved performance on 4 VQA benchmarks.

EgoDTM EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Boshen Xu, Yuting Mei, Xinbi Liu, Sipeng Zheng, Qin Jin
arxiv, 2025
code / arxiv

We introduce EgoDTM, an Egocentric Depth- and Text-aware Model that bridges the gap between 2D visual understanding and 3D spatial awareness.

TimeZero TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM
Ye Wang*, Boshen Xu*, Zihao Yue, Zihan Xiao , Ziheng Wang, Liang Zhang , Dingyi Yang, Wenxuan Wang, Qin Jin
arxiv, 2025
code / paper

We propose TimeZero, a reasoning-guided LVLM for temporal video grounding that extends inference through reinforcement learning to reason about video-language relationships. Achieves state-of-the-art performance on Charades-STA benchmark.

clean-usnob Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions?
Boshen Xu, Ziheng Wang*, Yang Du* , Zhinan Song, Sipeng Zheng, Qin Jin
ICLR, 2025
code / paper

We propose EgoNCE++, an asymmetric contrastive learning pretraining objective to solve the EgoVLMs' weakness in distinguishing HOI combinations with word variations.

clean-usnob SPAFormer: Sequential 3D Part Assembly with Transformers
Boshen Xu, Sipeng Zheng, Qin Jin
3DV, 2025
project page / code / arxiv

We present SPAFormer, a transformer-based framework that leverages assembly sequences constraints with three part encodings to address the combinatorial explosion challenge in 3D-PA task.

clean-usnob Unveiling Visual Biases in Audio-Visual Localization Benchmarks
Liangyu Chen, Zihao Yue, Boshen Xu, Qin Jin
ECCV AVGenL Workshop, 2024
arxiv

We reveal that current audio-visual source localization benchmarks (VGG-SS, Epic-Sounding-Object) are easily hacked by vision-only models, therefore calling for a benchmark that requires more audio cues.

clean-usnob Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World
Boshen Xu, Sipeng Zheng, Qin Jin
ACM MM, 2023
project page / code / arxiv

We propose POV, a view adaptation framework that enables transfer learning from multi-view third-person videos to egocentric videos.

clean-usnob Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework
Sipeng Zheng, Boshen Xu, Qin Jin
CVPR, 2023

We introduce OpenCat, a language modeling framework that reformulates HOI prediction as sequence generation.

Awards

  • 2023, Outstanding Graduate, Sichuan, China
  • 2021, Tencent Special Scholarship, UESTC & Tencent
  • 2021, Second Prize of China Undergraduate Mathematical Contest in Modeling, China
  • 2020, National Scholarship, UESTC, China

Services

  • Conference Reviewer for CVPR, NeurIPS, ICLR, ACM MM, ACL, ACCV.
  • Journal Reviewer for TOMM.
  • Teaching Assistant for Multimedia Application Technology (RUC 2024 Fall).

Feel free to steal this website's template. Inspired by Jon's website.