I'm a second-year Ph.D. student in Computer Science (2023 - [Expected] 2027) at Yale University, supervised by Prof. Alex Wong. Previous to that, I obtained my B.Eng. in Computer Science (2019 - 2023) at ShanghaiTech University, minor in Innovation and Entrepreneurship.
I am an incoming PhD Research Intern at Nvidia Research, mentoring by Orazio Gallo. Previously, I interened with Prof. Jianbo Shi at UPenn GRASP Lab, with Prof. Xuming He at ShanghaiTech PLUS Group.
I conduct research on Computer Vision, Machine Learning, and Robotics. I mainly focus on Multimodal Embodied AI inspired by human learning. Currently, my research mainly lies in Vision-Language Models for 3D Vision (Perception and Reconstruction).
Google Scholar /
GitHub /
Yale Vision Lab
Reviewer: CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML
Email: ziyao.zeng (at) yale.edu
|
|
|
Website format from Xingyi Zhou.
Last updated April 2025
Research Overview
Since the age of 13, deeply touched by Foundation by Isaac Asimov, my dream has been to create an AI who can think like humans (just like the dream of Prof. Jürgen Schmidhuber). When humans perceive the surrounding environment, we see (2D vision), touch (tactile), wander (3D vision), and hear (audio) simultaneously to understand (neural signal) and interpret (language). Therefore, I conduct research on Multimodal Embodied AI. My research vision is to empower embodied AI with multimodal sensing, and can leverage pre-trained multimodal representations, to interact with the physical world as humans do.
Specifically, I conduct research on Language for 3D Vision (Perception and Reconstruction). Given one language description, one can easily imagine what this scene could look like, so language could easily be interpreted as a condition to generate and manipulate 3D scenes in a controllable manner. On the other hand, language description could serve as a prior that is specific for a given scene to enhance 3D reconstruction by resolving scale ambiguity. Language description itself can also be used to tell ordinal relationships between different objects, so that to infer their depth. The use of language descriptions, which is invariant to nuisance variability (e.g., illumination, occlusion, viewpoints in images), provides extra robust features to assist models’ generalization. Practically, language is arguably cheaper to obtain than range measurements (e.g., from lidar, radar).
Publications
Selected publications of "Language for 3D Vision" are highlighted.
(* indicates equal contributions)
2025
2024
PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation
Ziyao Zeng, Jingcheng Ni, Daniel Wang, Patrick Rim, Younjoon Chung, Fengyu Yang, Byung-Woo Hong, Alex Wong
arXiv technical report, 2024
RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions
Ziyao Zeng, Yangchao Wu, Hyoungseob Park, Daniel Wang, Fengyu Yang, Stefano Soatto, Dong Lao, Byung-Woo Hong, Alex Wong
NeurIPS 2024
NeuroBind: Towards Unified Multimodal Representations for Neural Signals
Fengyu Yang*, Chao Feng*, Daniel Wang*, Tianye Wang,
Ziyao Zeng, Zhiyang Xu, Hyoungseob Park, Pengliang Ji, Hanbin Zhao, Yuanning Li, Alex Wong
arXiv technical report, 2024
2023
WorDepth: Variational Language Prior for Monocular Depth Estimation
Ziyao Zeng, Daniel Wang, Fengyu Yang, Hyoungseob Park, Yangchao Wu, Stefano Soatto, Byung-Woo Hong, Dong Lao, Alex Wong
CVPR 2024
code
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations
Fengyu Yang*, Chao Feng*, Ziyang Chen*, Hyoungseob Park, Daniel Wang, Yiming Dou,
Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, Alex Wong
CVPR 2024
project page,
code
2022
PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning
Xiangyang Zhu*, Renrui Zhang*, Bowei He, Ziyu Guo,
Ziyao Zeng, Zipeng Qin, Shanghang Zhang, Peng Gao
ICCV 2023
code