1 Yale University 2 Brown University 3 Chung-Ang University
* Equal contribution
Integrating language into diffusion models enhances monocular depth estimation by providing an additional condition associated with plausible 3D scenes, thus reducing the solution space. The conditional distribution is learned during text-to-image pre-training and associated with depth during fine-tuning with image-text-depth pairs.
We train the diffusion model to predict noise in the noisy depth latent conditioned on the input image and language description. At inference, the model denoises from Gaussian noise to a depth latent, then decodes to the depth map via a frozen VAE decoder.
Integrating language consistently improves Marigold, Lotus-D, Lotus-G, and E2E-FT across NYUv2, KITTI, ETH3D, ScanNet, and DIODE. Below: selected metrics (δ₁↑, AbsRel↓). Our “Train & Infer” text variants are highlighted.
| Method | NYUv2 δ₁↑ | NYUv2 AbsRel↓ | KITTI δ₁↑ | KITTI AbsRel↓ | ETH3D δ₁↑ | ScanNet δ₁↑ | DIODE δ₁↑ |
|---|---|---|---|---|---|---|---|
| Marigold | 95.9 | 6.0 | 90.4 | 10.5 | 95.1 | 94.5 | 77.2 |
| Marigold + Text (Train & Infer) | 95.9 | 5.9 | 90.6 | 10.4 | 95.7 | 94.9 | 78.9 |
| Lotus-D* | 96.6 | 5.6 | 92.2 | 8.7 | 96.8 | 96.0 | 74.1 |
| Lotus-D + Text (Train & Infer) | 96.8 | 5.4 | 93.0 | 8.4 | 97.0 | 96.6 | 74.2 |
| Lotus-G* | 95.2 | 6.7 | 92.2 | 8.9 | 95.7 | 93.7 | 71.7 |
| Lotus-G + Text (Train & Infer) | 96.3 | 5.9 | 92.8 | 8.6 | 96.3 | 95.3 | 72.5 |
| E2E-FT* | 95.4 | 6.9 | 90.1 | 10.5 | 94.1 | 94.6 | 76.4 |
| E2E-FT + Text (Train & Infer) | 96.3 | 6.2 | 91.7 | 9.7 | 94.7 | 95.0 | 77.0 |
Language improves depth for specified or ambiguous regions (e.g. soap dispenser, lamps, kitchen in background, parked car, distant signs). Iterative refinement with more detailed text further improves specified regions.
NYUv2: better depth for objects mentioned in the text (e.g. soap dispenser, lamps).
KITTI: improved depth for described objects (e.g. parked car, distant sign).
Iterative depth refinement with more detailed language.
Language improves depth perception of specified regions (e.g. kitchen in background).
Integrating language accelerates training convergence.
Fewer denoising steps needed at inference with language (e.g. ~10 vs ~25 steps).
@article{zeng2024iris,
title={Iris: Integrating Language into Diffusion-based Monocular Depth Estimation},
author={Zeng, Ziyao and Ni, Jingcheng and Wang, Daniel and Rim, Patrick and Chung, Younjoon and Yang, Fengyu and Hong, Byung-Woo and Wong, Alex},
journal={arXiv preprint arXiv:2411.16750},
year={2024}
}