Ziyao (Adonis) Zeng 曾子尧

Ph.D. Student in Computer Science
Yale University

I'm a third-year Ph.D. student in Computer Science (2023 - [Expected] 2027) at Yale University. Previous to that, I obtained my B.Eng. in Computer Science (2019 - 2023) at ShanghaiTech University, minor in Innovation and Entrepreneurship.

I conduct research on Multimodal Learning inspired by human cognition towards Multimodal Agentic AI, especially vision-language models and spatial understanding. My line of work in "Language for 3D Vision" explores how vision-language models can perceive and understand the world like humans do. My expertise lies in vision-language models, large-language models, diffusion models, multimodal learning, and 2D/3D computer vision.

Google Scholar LinkedIn GitHub Email

Collaborators

Nvidia Research

Dr. Orazio Gallo, Dr. Hang Su, Dr. Jindong Jiang, Dr. Abhishek Badki, Dr. Sifei Liu

Yale Vision Lab

Prof. Alex Wong, Prof. Dong Lao, Prof. Byung-Woo Hong, Prof. Stefano Soatto

UPenn GRASP Lab

Prof. Jianbo Shi, Dr. Renrui Zhang

ShanghaiTech PLUS Group

Prof. Xuming He, Zhitong Gao

I was also fortunate to intern at Shanghai AI Lab and UISEE during my undergraduate studies.

Reviewer Services

CVPR (2022, 2025 [Outstanding Reviewer], 2026), ICCV (2023, 2025), ECCV (2024), ICML (2025), ICLR (2025, 2026), NeurIPS (2024, 2025), ACM MM (2023, 2025), AISTATS (2024, 2025), ICASSP (2024, 2025), TCSVT (journal)

Research

Multimodal Learning towards Multimodal Agentic AI

Since the age of 13, deeply touched by Foundation by Isaac Asimov, my dream has been to create an AI who can think like humans (just like the dream of Prof. Jürgen Schmidhuber). When humans perceive the surrounding environment, we see (2D vision), hear (audio), feel (tactile), and interact (3D vision) simultaneously to reason (language) and understand (neural signal) the world. Therefore, I conduct research on Multimodal Learning towards Multimodal Agentic AI. My research vision is to empower agentic AI with multimodal sensing and multimodal representations, enabling it to perceive, reason, understand, and interact with both the digital and physical worlds like humans do.

Specifically, my line of work in "Language for 3D Vision" (DepthCLIP, PointCLIPv2, WorDepth, RSA, Iris) explores how vision-language models can perceive and understand the world like humans do.

Publications

Selected publications are highlighted.

(* indicates equal contributions)

2026

RuleSmith: Multi-Agent LLMs for Automated Game Balancing

Ziyao Zeng, Chen Liu, Tianyu Liu, Hao Wang, Xiatao Sun, Fengyu Yang, Xiaofeng Liu, Zhiwen Fan

arXiv technical report, 2026 | project page, code

2025

Coffee: Controllable Diffusion Fine-tuning

Ziyao Zeng, Jingcheng Ni, Ruyi Liu, Alex Wong

arXiv technical report, 2025

ETA: Energy-based Test-time Adaptation for Depth Completion

Younjoon Chung*, Hyoungseob Park*, Patrick Rim*, Xiaoran Zhang, Jihe He, Ziyao Zeng, Safa Cicek, Byung-Woo Hong, James S. Duncan, Alex Wong

ICCV 2025

ProtoDepth: Unsupervised Continual Depth Completion with Prototypes

Patrick Rim, Hyoungseob Park, S. Gangopadhyay, Ziyao Zeng, Younjoon Chung, Alex Wong

CVPR 2025 | project page, code

HOMER: Homography-Based Efficient Multi-view 3D Object Removal

Jingcheng Ni*, Weiguang Zhao*, Daniel Wang, Ziyao Zeng, Chenyu You, Alex Wong, Kaizhu Huang

arXiv technical report, 2025

2024

Iris: Integrating Language into Diffusion-based Monocular Depth Estimation

Ziyao Zeng*, Jingcheng Ni*, Daniel Wang, Patrick Rim, Younjoon Chung, Fengyu Yang, Byung-Woo Hong, Alex Wong

arXiv technical report, 2024 | NECV 2025 Oral Presentation (18.75%)

RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions

Ziyao Zeng, Yangchao Wu, Hyoungseob Park, Daniel Wang, Fengyu Yang, Stefano Soatto, Dong Lao, Byung-Woo Hong, Alex Wong

NeurIPS 2024 | code

NeuroBind: Towards Unified Multimodal Representations for Neural Signals

Fengyu Yang*, Chao Feng*, Daniel Wang*, Tianye Wang, Ziyao Zeng, Zhiyang Xu, Hyoungseob Park, Pengliang Ji, Hanbin Zhao, Yuanning Li, Alex Wong

arXiv technical report, 2024

2023

WorDepth: Variational Language Prior for Monocular Depth Estimation

Ziyao Zeng, Daniel Wang, Fengyu Yang, Hyoungseob Park, Yangchao Wu, Stefano Soatto, Byung-Woo Hong, Dong Lao, Alex Wong

CVPR 2024 | code

Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

Fengyu Yang*, Chao Feng*, Ziyang Chen*, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, Alex Wong

CVPR 2024 | project page, code