|
I'm a third-year Ph.D. student in Computer Science (2023 - [Expected] 2027) at Yale University. Previous to that, I obtained my B.Eng. in Computer Science (2019 - 2023) at ShanghaiTech University, minor in Innovation and Entrepreneurship.
I conduct research on Multimodal Learning inspired by human cognition towards Multimodal Agentic AI, especially vision-language models and spatial understanding. My line of work in “Language for 3D Vision” explores how vision-language models can perceive and understand the world like humans do. My expertise lies in vision-language models, large-language models, diffusion models, multimodal learning, and 2D/3D computer vision.
I am fortunate to work with many excellent collaborators:
Links:
Google Scholar /
Linkedn /
GitHub
Reviewer Services: CVPR (2022, 2025 [Outstanding Reviewer], 2026), ICCV (2023, 2025), ECCV (2024), ICML (2025), ICLR (2025, 2026), NeurIPS (2024, 2025), ACM MM (2023, 2025), AISTATS (2024, 2025), ICASSP (2024, 2025), TCSVT (journal)
Email: ziyao.zeng (at) yale.edu
|
|
|
Website format from Xingyi Zhou.
Last updated Nov. 2025
Research Overview
Since the age of 13, deeply touched by Foundation by Isaac Asimov, my dream has been to create an AI who can think like humans (just like the dream of Prof. Jürgen Schmidhuber). When humans perceive the surrounding environment, we see (2D vision), hear (audio), feel (tactile), and interact (3D vision) simultaneously to reason (language) and understand (neural signal) the world. Therefore, I conduct research on Multimodal Learning towards Multimodal Agentic AI. My research vision is to empower agentic AI with multimodal sensing and multimodal representations, enabling it to perceive, reason, understand, and interact with both the digital and physical worlds like humans do.
Specifically, my line of work in “Language for 3D Vision” (DepthCLIP, PointCLIPv2, WorDepth, RSA, Iris) explores how vision-language models can perceive and understand the world like humans do.
Publications
Selected publications are highlighted.
(* indicates equal contributions)
2025
2024
Iris: Integrating Language into Diffusion-based Monocular Depth Estimation
Ziyao Zeng*, Jingcheng Ni*, Daniel Wang, Patrick Rim, Younjoon Chung, Fengyu Yang, Byung-Woo Hong, Alex Wong
arXiv technical report, 2024 |
NECV 2025 Oral Presentation (18.75%)
RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions
Ziyao Zeng, Yangchao Wu, Hyoungseob Park, Daniel Wang, Fengyu Yang, Stefano Soatto, Dong Lao, Byung-Woo Hong, Alex Wong
NeurIPS 2024
code
NeuroBind: Towards Unified Multimodal Representations for Neural Signals
Fengyu Yang*, Chao Feng*, Daniel Wang*, Tianye Wang,
Ziyao Zeng, Zhiyang Xu, Hyoungseob Park, Pengliang Ji, Hanbin Zhao, Yuanning Li, Alex Wong
arXiv technical report, 2024
2023
WorDepth: Variational Language Prior for Monocular Depth Estimation
Ziyao Zeng, Daniel Wang, Fengyu Yang, Hyoungseob Park, Yangchao Wu, Stefano Soatto, Byung-Woo Hong, Dong Lao, Alex Wong
CVPR 2024
code
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations
Fengyu Yang*, Chao Feng*, Ziyang Chen*, Hyoungseob Park, Daniel Wang, Yiming Dou,
Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, Alex Wong
CVPR 2024
project page,
code
2022
PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning
Xiangyang Zhu*, Renrui Zhang*, Bowei He, Ziyu Guo,
Ziyao Zeng, Zipeng Qin, Shanghang Zhang, Peng Gao
ICCV 2023
code
Can Language Understand Depth?
Renrui Zhang*,
Ziyao Zeng*, Ziyu Guo, Yafeng Li
ACM Multimedia 2022, accepted as Brave New Idea (Accepte Rate<=12.5%)
code
2021
Twitter Emotion Classification
Yiteng Xu*,
Ziyao Zeng*, Jirui Shi*, Shaoxun Wu*, Peiyan Gu*
Final Project of CS181 Artificial Intelligence, 2021 Fall, ShanghaiTech University
code
2020
My Adventure
I am a big fan of adventure who is enthusiastic about cycling, hiking and mountain climbing.
"Being a scientist and an adventurer has a lot of similarities, they both want to achieve something that hasn't been achieved before."
In 2015, I have hiked across Lake District of England in 1 week.
In 2019, I have cycled cross Tibet for 28 days from Chengdu to Lhasa for 2135 km.
In 2022, I have cycled cross Tibet and Xinjiang for 1 month from Ürümqi to Lhasa for 5000 km, with about 2000 km cycling at an average altitude of 4500 m.
In 2023, I hiked in Yubeng Village for 5 days, across an altitude between 3000 m to 4300 m.
My hiking video in Ice Lake, 3700 m altitude
Link
My hiking video in God Lake, 4300 m altitude:
Link
In 2023, I hiked in Tiger Leaping Gorge High Road for 2 days.
In 2023, I cycle around Qinghai Lake more than 350 km for 4 days .
In 2024, I got my diving certificate in the Red Sea.
In 2024, sucessfully climbed to the top of Mount Yuzhu, 6178 meters in altitude.
In 2025 summer, I drove from Yale (CT) to Nvidia (CA) across US for my summer internship in 10 days, then drove back in 14 days. Along the way, I traveled through 22 states and visited 9 national parks.
I recently got my temporary student pilot license in Nov. 2025 issued by FAA, looking forward my solo flight!
My other photos regarding adventures.
Other things about myself
I'm an amateur Unity game developer, previous supervised by Brain Cox, screenshots of my previous works have been shown below.
Snow Ranger
Darkside
I'm also an amateur pianist, trombone player, guitar player, and Chinese folk signer.
I have been playing Tarot since 2014, dedicating to combining Tarot with modern psychology to serve as a tool for consciousness.
Previously, I volunteered at WWF-China and Greenpeace