Ph.D. Student in Computer Science
Yale University
I'm a third-year Ph.D. student in Computer Science (2023 - [Expected] 2027) at Yale University. Previous to that, I obtained my B.Eng. in Computer Science (2019 - 2023) at ShanghaiTech University, minor in Innovation and Entrepreneurship.
I conduct research on Multimodal Learning inspired by human cognition towards Multimodal Agentic AI, especially vision-language models and spatial understanding. My line of work in "Language for 3D Vision" explores how vision-language models can perceive and understand the world like humans do. My expertise lies in vision-language models, large-language models, diffusion models, multimodal learning, and 2D/3D computer vision.
I was also fortunate to intern at Shanghai AI Lab and UISEE during my undergraduate studies.
CVPR (2022, 2025 [Outstanding Reviewer], 2026), ICCV (2023, 2025), ECCV (2024), ICML (2025), ICLR (2025, 2026), NeurIPS (2024, 2025), ACM MM (2023, 2025), AISTATS (2024, 2025), ICASSP (2024, 2025), TCSVT (journal)
Multimodal Learning towards Multimodal Agentic AI
Since the age of 13, deeply touched by Foundation by Isaac Asimov, my dream has been to create an AI who can think like humans (just like the dream of Prof. Jürgen Schmidhuber). When humans perceive the surrounding environment, we see (2D vision), hear (audio), feel (tactile), and interact (3D vision) simultaneously to reason (language) and understand (neural signal) the world. Therefore, I conduct research on Multimodal Learning towards Multimodal Agentic AI. My research vision is to empower agentic AI with multimodal sensing and multimodal representations, enabling it to perceive, reason, understand, and interact with both the digital and physical worlds like humans do.
Specifically, my line of work in "Language for 3D Vision" (DepthCLIP, PointCLIPv2, WorDepth, RSA, Iris) explores how vision-language models can perceive and understand the world like humans do.
Selected publications are highlighted.
(* indicates equal contributions)
arXiv technical report, 2026 | project page, code
arXiv technical report, 2025
arXiv technical report, 2025
arXiv technical report, 2024 | NECV 2025 Oral Presentation (18.75%)
arXiv technical report, 2024
ACM Multimedia 2022, accepted as Brave New Idea (Accepte Rate<=12.5%) | code
arXiv technical report, 2021
Final Project of CS181 Artificial Intelligence, 2021 Fall, ShanghaiTech University | code
Final Project of CS282 Machine Learning, 2021 Spring, ShanghaiTech University
Final Project of CS280 Deep Learning, 2020 Fall, ShanghaiTech University
I'm an amateur Unity game developer, previous supervised by Brain Cox, screenshots of my previous works have been shown below.
I'm also an amateur pianist, trombone player, guitar player, and Chinese folk singer.
I have been playing Tarot since 2014, dedicating to combining Tarot with modern psychology to serve as a tool for consciousness.
Previously, I volunteered at WWF-China and Greenpeace
Page views