Ph.D. Student in Computer Science
Yale University
I am looking for a full-time position starting May 2027!
I'm a third-year Ph.D. student in Computer Science (2023 - [Expected] 2027) at Yale University, advised by Prof. Yuval Kluger (primary) and Prof. Daniel Rakita. Previous to that, I obtained my B.Eng. in Computer Science (2019 - 2023) at ShanghaiTech University, minor in Innovation and Entrepreneurship. During my PhD, I did research internship at Nvidia Reserach and Meta Reality Lab.
I conduct research on Multimodal Learning inspired by human cognition towards Multimodal Agentic AI, especially vision-language models and spatial understanding. My line of work in "Language for 3D Vision" explores how vision-language models can perceive and understand the world like humans do. My expertise lies in vision-language models, large-language models, diffusion models, multimodal learning, and 2D/3D computer vision.
I was also fortunate to intern at Shanghai AI Lab and UISEE during my undergraduate studies.
CVPR (2022, 2025 [Outstanding Reviewer], 2026), ICCV (2023, 2025), ECCV (2024, 2026), ICML (2025, 2026), ICLR (2025, 2026), NeurIPS (2024, 2025), BMVC (2026), ACM MM (2023, 2025), AISTATS (2024, 2025), ICASSP (2024, 2025), TCSVT (journal), TMLR (journal)
Multimodal RepresentationLearning towards Agentic AI
Since the age of 13, deeply touched by Foundation by Isaac Asimov, my dream has been to create an AI who can think like humans (just like the dream of Prof. Jürgen Schmidhuber). When humans perceive the surrounding environment, we see (vision), hear (audio), feel (tactile) simultaneously to reason (language) and understand (neural signal), then interact (action) with the world. Therefore, I conduct research on Multimodal Representation Learning towards Agentic AI. My key reserach questions are (1) how to obtain good multimodal representations for agents, and (2) how to use those representations to make agents better perceive, reason, understand, and interact with the world.
Specifically, my line of work in "Language for 3D Vision" (DepthCLIP, PointCLIPv2, WorDepth, RSA, Iris) explores how vision-language models can perceive and understand the world like humans do.
Selected publications are highlighted.
(* indicates equal contributions)
Under Review of Nature | paper
arXiv technical report, 2026 | NE Agents Day 2026 | project page, code, paper
ECCV 2026 | Coming Soon
ECCV 2026 | Coming Soon
Submit to a top-tier conference | Coming Soon
Submit to a top-tier conference | Coming Soon
arXiv technical report, 2025
arXiv technical report, 2025
arXiv technical report, 2024
ACM Multimedia 2022, accepted as Brave New Idea (Accepte Rate<=12.5%) | code
arXiv technical report, 2021
Final Project of CS181 Artificial Intelligence, 2021 Fall, ShanghaiTech University | code
Final Project of CS282 Machine Learning, 2021 Spring, ShanghaiTech University
Final Project of CS280 Deep Learning, 2020 Fall, ShanghaiTech University
I'm an amateur Unity game developer, previous supervised by Brain Cox, screenshots of my previous works have been shown below.
I'm also an amateur pianist, trombone player, guitar player, and Chinese folk singer.
I have been playing Tarot since 2014, dedicating to combining Tarot with modern psychology to serve as a tool for consciousness.
Previously, I volunteered at WWF-China and Greenpeace
Page views