I'm a first-year Ph.D. student in Computer Science (2023 - [Expected] 2028) at Yale University, supervised by Prof. Alex Wong. Previous to that, I obtained my B.Eng. in Computer Science (2019 - 2023) at ShanghaiTech University.
Previously, I interened with Prof. Jianbo Shi at UPenn GRASP Lab, with Prof. Xuming He at ShanghaiTech PLUS Group.
I conduct research on Multimodal Learning inspired by human learning. Currently, my research mainly lies in Language for 3D Vision (Perception, Manipulation, and Generation).
Google Scholar /
GitHub /
Yale Vision Lab
Reviewer: CVPR 2022, ICCV 2023, ACM MM 2023, ICASSP 2024, ECCV 2024, ACCV 2024
Email: ziyao.zeng@@yale.edu
|
|
|
Website format from Xingyi Zhou.
Last updated Sept. 2023
Research Overview
Since the age of 13, deeply touched by Foundation by Isaac Asimov, my dream has been to explore the galaxy. Unable to fulfill this mission with existing technology, I reoriented my goal to create an AI who can think like humans (just like the dream of Prof. Jürgen Schmidhuber), which we call Artificial General Intelligence (AGI) today, and explore the galaxy together. When humans perceive the surrounding environment, we see (2D vision), touch (tactile), sense (3D vision), and hear (audio) simultaneously to understand (language). Therefore, AGI’s understanding should likewise be based upon the complementary learning of different modalities.
Recently, human-like AI perception has come closer to reality. Large-scale foundation models pre-trained with multimodal data provide promising unified frameworks for Multimodal Learning. In particular, Contrastive Language-Image Pretraining (CLIP) trains both an image and text encoder, and conducts contrastive learning in feature space. It simulates the process of parents pointing to objects to assist children in recognizing them. In this simple relationship, children not only learn to classify objects but also to segment them without any pixel-wise mask annotations or box annotations.
DepthCLIP I directed, expanding upon CLIP, reveals that humans learn to predict depth not by pixel-wise depth annotation, but by relative depth semantics. Children are taught “this tree is far, and that bus is close.” Consequently, they build a semantic depth understanding of seen monocular images, indicating which object is near, and which object is far. We found that CLIP also learns from a mutual understanding of semantic language-image concepts, and thus has the same ability to distinguish relative depth as humans, and can conduct zero-shot training-free monocular depth estimation. WorDepth, on the other hand, improves monocular depth estimation models by incorporating human language guidance into training, since language can provide the depth model with geometry priors that are associated with semantics but not explicitly addressed in depth estimation datasets.
Going forward, I conduct research on
Multimodal Learning inspired by human learning, specifically
Language for 3D Vision (Perception, Manipulation, and Generation).
Publications
(* indicates equal contributions)
2023
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations
Fengyu Yang*, Chao Feng*, Ziyang Chen*, Hyoungseob Park, Daniel Wang, Yiming Dou,
Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, Alex Wong
CVPR 2024
Project Page,
Code
2022
2021
Twitter Emotion Classification
Yiteng Xu*,
Ziyao Zeng*, Jirui Shi*, Shaoxun Wu*, Peiyan Gu*
Final Project of CS181 Artificial Intelligence, 2021 Fall, ShanghaiTech University
code
2020
My Adventure
I am a big fan of adventure who is enthusiastic about cycling, hiking and mountain climbing.
"Being a scientist and an adventurer has a lot of similarities, they both want to achieve something that hasn't been achieved before."
In 2015, I have hiked across Lake District of England in 1 week.
In 2019, I have cycled cross Tibet for 28 days from Chengdu to Lhasa for 2135 km.
In 2022, I have cycled cross Tibet and Xinjiang for 1 month from Ürümqi to Lhasa for 5000 km, with about 2000 km cycling at an average altitude of 4500 m.
In 2023, I hiked in Yubeng Village for 5 days, across an altitude between 3000 m to 4300 m.
My hiking video in Ice Lake, 3700 m altitude
Link
My hiking video in God Lake, 4300 m altitude:
Link
In 2023, I hiked in Tiger Leaping Gorge High Road for 2 days.
In 2023, I cycle around Qinghai Lake more than 350 km for 4 days .
In 2024, I got my diving certificate in the Red Sea.
My other photos regarding adventures.
Other things about myself
I'm an amateur Unity game developer, previous supervised by Brain Cox, screenshots of my previous works have been shown below.
Snow Ranger
Darkside
I'm also an amateur composer, conducter, pianist, trombone player, guitar player, and Chinese folk signer.
I have been playing Tarot since 2014, familiar with Thoth and Flower Shadow, dedicated to combining Tarot with modern psychology to serve as a tool for consciousness.
I'm excited about all kinds of voluntary especially those related to environment protection.
I believe it's our instinctive duty to preserve the integrity of the earth (at least until we could immigrate to other planets).
Currently, I'm volunteering at WWF-China and Greenpeace