Boshen Xu

About

My research interests include video understanding (e.g., action recognition, vision-language pretraining, etc.) and embodied AI (e.g., 3D object assembly, 3D HOI reconstruction, etc.). I'm currently focusing on egocentric vision and related topics that benefit VR/AR/Embodied AI.

I am a second-year PhD student at Renmin University of China (RUC) under the supervision of Professor Qin Jin at AIM3 Lab. Prior to joining RUC, I got my bachelor degree from School of Computer Science and Engineering, University of Electronic Science and Technology of China (UESTC).

Email  /  CV  /  Github  /  Google Scholar

profile photo

Publications

* denotes equal contributions.
EgoDTM EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Boshen Xu, Yuting Mei, Xinbi Liu, Sipeng Zheng, Qin Jin
arxiv, 2025
code / arxiv

We introduce EgoDTM, an Egocentric Depth- and Text-aware Model that bridges the gap between 2D visual understanding and 3D spatial awareness.

TimeZero TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM
Ye Wang*, Boshen Xu*, Zihao Yue, Zihan Xiao , Ziheng Wang, Liang Zhang , Dingyi Yang, Wenxuan Wang, Qin Jin
arxiv, 2025
code / paper

We propose TimeZero, a reasoning-guided LVLM for temporal video grounding that extends inference through reinforcement learning to reason about video-language relationships. Achieves state-of-the-art performance on Charades-STA benchmark.

clean-usnob Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions?
Boshen Xu, Ziheng Wang*, Yang Du* , Zhinan Song, Sipeng Zheng, Qin Jin
ICLR, 2025
code / paper

We propose EgoNCE++, an asymmetric contrastive learning pretraining objective to solve the EgoVLMs' weakness in distinguishing HOI combinations with word variations.

clean-usnob SPAFormer: Sequential 3D Part Assembly with Transformers
Boshen Xu, Sipeng Zheng, Qin Jin
3DV, 2025
project page / code / arxiv

We present SPAFormer, a transformer-based framework that leverages assembly sequences constraints with three part encodings to address the combinatorial explosion challenge in 3D-PA task.

clean-usnob Unveiling Visual Biases in Audio-Visual Localization Benchmarks
Liangyu Chen, Zihao Yue, Boshen Xu, Qin Jin
ECCV AVGenL Workshop, 2024
arxiv

We reveal that current audio-visual source localization benchmarks (VGG-SS, Epic-Sounding-Object) are easily hacked by vision-only models, therefore calling for a benchmark that requires more audio cues.

clean-usnob Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World
Boshen Xu, Sipeng Zheng, Qin Jin
ACM MM, 2023
project page / code / arxiv

We propose POV, a view adaptation framework that enables transfer learning from multi-view third-person videos to egocentric videos.

clean-usnob Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework
Sipeng Zheng, Boshen Xu, Qin Jin
CVPR, 2023

We introduce OpenCat, a language modeling framework that reformulates HOI prediction as sequence generation.

Awards

  • 2023, Outstanding Graduate, Sichuan, China
  • 2021, Tencent Special Scholarship, UESTC & Tencent
  • 2021, Second Prize of China Undergraduate Mathematical Contest in Modeling, China
  • 2020, National Scholarship, UESTC, China

Services

  • Conference Reviewer for ICLR, ACM MM, ACCV.
  • Teaching Assistant for Multimedia Application Technology (RUC 2024 Fall).

Feel free to steal this website's template. Inspired by Jon's website.