Ziyao Zeng

I'm a second-year Ph.D. student in Computer Science (2023 - [Expected] 2027) at Yale University, supervised by Prof. Alex Wong. Previous to that, I obtained my B.Eng. in Computer Science (2019 - 2023) at ShanghaiTech University, minor in Innovation and Entrepreneurship.

I am an incoming PhD Research Intern at Nvidia Research, mentoring by Orazio Gallo. Previously, I interened with Prof. Jianbo Shi at UPenn GRASP Lab, with Prof. Xuming He at ShanghaiTech PLUS Group.

I conduct research on Computer Vision, Machine Learning, and Robotics. I mainly focus on Multimodal Embodied AI inspired by human learning. Currently, my research mainly lies in Vision-Language Models for 3D Vision (Perception and Reconstruction).

Google Scholar / GitHub / Yale Vision Lab

Reviewer: CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, ACM Multimedia, AISTATS, ICASSP

Email: ziyao.zeng (at) yale.edu

Website format from Xingyi Zhou.

Last updated April 2025

Research Overview

Since the age of 13, deeply touched by Foundation by Isaac Asimov, my dream has been to create an AI who can think like humans (just like the dream of Prof. Jürgen Schmidhuber). When humans perceive the surrounding environment, we see (2D vision), touch (tactile), wander (3D vision), and hear (audio) simultaneously to understand (neural signal) and interpret (language). Therefore, I conduct research on Multimodal Embodied AI. My research vision is to empower embodied AI with multimodal sensing, and can leverage pre-trained multimodal representations, to interact with the physical world as humans do.

Specifically, I conduct research on Language for 3D Vision (Perception and Reconstruction). Given one language description, one can easily imagine what this scene could look like, so language could easily be interpreted as a condition to generate and manipulate 3D scenes in a controllable manner. On the other hand, language description could serve as a prior that is specific for a given scene to enhance 3D reconstruction by resolving scale ambiguity. Language description itself can also be used to tell ordinal relationships between different objects, so that to infer their depth. The use of language descriptions, which is invariant to nuisance variability (e.g., illumination, occlusion, viewpoints in images), provides extra robust features to assist models’ generalization. Practically, language is arguably cheaper to obtain than range measurements (e.g., from lidar, radar).

Publications

Selected publications of "Language for 3D Vision" are highlighted.

(* indicates equal contributions)

2025

ProtoDepth: Unsupervised Continual Depth Completion with Prototypes
Patrick Rim, Hyoungseob Park, S. Gangopadhyay, Ziyao Zeng, Younjoon Chung, Alex Wong
CVPR 2025

HOMER: Homography-Based Efficient Multi-view 3D Object Removal
Jingcheng Ni*, Weiguang Zhao*, Daniel Wang, Ziyao Zeng, Chenyu You, Alex Wong, Kaizhu Huang
arXiv technical report, 2025

2024

PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation
Ziyao Zeng, Jingcheng Ni, Daniel Wang, Patrick Rim, Younjoon Chung, Fengyu Yang, Byung-Woo Hong, Alex Wong
arXiv technical report, 2024

RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions
Ziyao Zeng, Yangchao Wu, Hyoungseob Park, Daniel Wang, Fengyu Yang, Stefano Soatto, Dong Lao, Byung-Woo Hong, Alex Wong
NeurIPS 2024

NeuroBind: Towards Unified Multimodal Representations for Neural Signals
Fengyu Yang*, Chao Feng*, Daniel Wang*, Tianye Wang, Ziyao Zeng, Zhiyang Xu, Hyoungseob Park, Pengliang Ji, Hanbin Zhao, Yuanning Li, Alex Wong
arXiv technical report, 2024

2023

WorDepth: Variational Language Prior for Monocular Depth Estimation
Ziyao Zeng, Daniel Wang, Fengyu Yang, Hyoungseob Park, Yangchao Wu, Stefano Soatto, Byung-Woo Hong, Dong Lao, Alex Wong
CVPR 2024
code

Binding Touch to Everything: Learning Unified Multimodal Tactile Representations
Fengyu Yang*, Chao Feng*, Ziyang Chen*, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, Alex Wong
CVPR 2024
project page, code

2022

PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning
Xiangyang Zhu*, Renrui Zhang*, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, Peng Gao
ICCV 2023
code

iQuery: Instruments as Queries for Audio-Visual Sound Separation
Jiaben Chen, Renrui Zhang, Dongze Lian, Jiaqi Yang, Ziyao Zeng, Jianbo Shi
CVPR 2023
code

Can Language Understand Depth?
Renrui Zhang*, Ziyao Zeng*, Ziyu Guo, Yafeng Li
ACM Multimedia 2022, accepted as Brave New Idea (Accepte Rate<=12.5%)
code

2021

DSPoint: Dual-scale Point Cloud Recognition with High-frequency Fusion
Renrui Zhang*, Ziyao Zeng*, Ziyu Guo, Xinben Gao, Kexue Fu, Jianbo Shi
SMC 2023
code

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts
Longtian Qiu, Renrui Zhang, Ziyu Guo, Ziyao Zeng, Yafeng Li, Guangnan Zhang
arXiv technical report, 2021

Twitter Emotion Classification
Yiteng Xu*, Ziyao Zeng*, Jirui Shi*, Shaoxun Wu*, Peiyan Gu*
Final Project of CS181 Artificial Intelligence, 2021 Fall, ShanghaiTech University
code

Generalized DUQ: Generalized Deterministic Uncertainty Quantification
Zhitong Gao*, Ziyao Zeng*
Final Project of CS282 Machine Learning, 2021 Spring, ShanghaiTech University

2020

Seek Common while Shelving Differences: A New Way for dealing with Noisy Labels
Zhitong Gao*, Ziyao Zeng*
Final Project of CS280 Deep Learning, 2020 Fall, ShanghaiTech University

My Adventure

I am a big fan of adventure who is enthusiastic about cycling, hiking and mountain climbing.

"Being a scientist and an adventurer has a lot of similarities, they both want to achieve something that hasn't been achieved before."

In 2015, I have hiked across Lake District of England in 1 week.

In 2019, I have cycled cross Tibet for 28 days from Chengdu to Lhasa for 2135 km.

In 2022, I have cycled cross Tibet and Xinjiang for 1 month from Ürümqi to Lhasa for 5000 km, with about 2000 km cycling at an average altitude of 4500 m.