Iris: Integrating Language into Diffusion-based Monocular Depth Estimation

Accepted by CVPR 2026

Ziyao Zeng^*1, Jingcheng Ni^*2, Daniel Wang¹, Patrick Rim¹, Younjoon Chung¹, Fengyu Yang¹, Byung-Woo Hong³, Alex Wong¹

¹ Yale University ² Brown University ³ Chung-Ang University
* Equal contribution

Paper Code

Abstract

Traditional monocular depth estimation suffers from inherent ambiguity and visual nuisances. We demonstrate that language can enhance monocular depth estimation by providing an additional condition (rather than images alone) aligned with plausible 3D scenes, thereby reducing the solution space for depth estimation. This conditional distribution is learned during the text-to-image pre-training of diffusion models. To generate images under various viewpoints and layouts that precisely reflect textual descriptions, the model implicitly models object sizes, shapes, and scales, their spatial relationships, and the overall scene structure. In this paper, Iris, we investigate the benefits of our strategy to integrate text descriptions into training and inference of diffusion-based depth estimation models. We experiment with three different diffusion-based monocular depth estimators (Marigold, Lotus, and E2E-FT) and their variants. By training on HyperSim and Virtual KITTI, and evaluating on NYUv2, KITTI, ETH3D, ScanNet, and DIODE, we find that our strategy improves the overall monocular depth estimation accuracy, especially in small areas. It also improves the model's depth perception of specific regions described in the text. We find that by providing more details in the text, the depth prediction can be iteratively refined. Simultaneously, we find that language can act as a constraint to accelerate the convergence of both training and the inference diffusion trajectory. Code and generated text data will be released upon acceptance.

Teaser

Integrating language into diffusion models enhances monocular depth estimation by providing an additional condition associated with plausible 3D scenes, thus reducing the solution space. The conditional distribution is learned during text-to-image pre-training and associated with depth during fine-tuning with image-text-depth pairs.

Key Findings

Finding 1: Language provides a condition about the existence, geometric properties, and spatial relationships of objects and scene structures, helping depth estimators reduce the depth solution space and better perceive depth, especially in insignificant or ambiguous regions.
Finding 2: Depth prediction can be iteratively refined with more details in the text description. This is particularly beneficial for regions that pose challenges to vision systems (small size, poor illumination, occlusion, or high visual similarity to the background).
Finding 3: Language serves as a constraint that accelerates training convergence and provides a good initialization of the diffusion trajectory to speed up inference.

Pipeline

We train the diffusion model to predict noise in the noisy depth latent conditioned on the input image and language description. At inference, the model denoises from Gaussian noise to a depth latent, then decodes to the depth map via a frozen VAE decoder.

Results

Integrating language consistently improves Marigold, Lotus-D, Lotus-G, and E2E-FT across NYUv2, KITTI, ETH3D, ScanNet, and DIODE. Below: selected metrics (δ₁↑, AbsRel↓). Our “Train & Infer” text variants are highlighted.

Method	NYUv2 δ₁↑	NYUv2 AbsRel↓	KITTI δ₁↑	KITTI AbsRel↓	ETH3D δ₁↑	ScanNet δ₁↑	DIODE δ₁↑
Marigold	95.9	6.0	90.4	10.5	95.1	94.5	77.2
Marigold + Text (Train & Infer)	95.9	5.9	90.6	10.4	95.7	94.9	78.9
Lotus-D*	96.6	5.6	92.2	8.7	96.8	96.0	74.1
Lotus-D + Text (Train & Infer)	96.8	5.4	93.0	8.4	97.0	96.6	74.2
Lotus-G*	95.2	6.7	92.2	8.9	95.7	93.7	71.7
Lotus-G + Text (Train & Infer)	96.3	5.9	92.8	8.6	96.3	95.3	72.5
E2E-FT*	95.4	6.9	90.1	10.5	94.1	94.6	76.4
E2E-FT + Text (Train & Infer)	96.3	6.2	91.7	9.7	94.7	95.0	77.0

Qualitative Results

Language improves depth for specified or ambiguous regions (e.g. soap dispenser, lamps, kitchen in background, parked car, distant signs). Iterative refinement with more detailed text further improves specified regions.

NYUv2: better depth for objects mentioned in the text (e.g. soap dispenser, lamps).

KITTI: improved depth for described objects (e.g. parked car, distant sign).

Iterative depth refinement with more detailed language.

Language improves depth for specified regions

Language improves depth perception of specified regions (e.g. kitchen in background).

Convergence & Efficiency

Integrating language accelerates training convergence.

Fewer denoising steps needed at inference with language (e.g. ~10 vs ~25 steps).

BibTeX

@article{zeng2024iris,
  title={Iris: Integrating Language into Diffusion-based Monocular Depth Estimation},
  author={Zeng, Ziyao and Ni, Jingcheng and Wang, Daniel and Rim, Patrick and Chung, Younjoon and Yang, Fengyu and Hong, Byung-Woo and Wong, Alex},
  journal={arXiv preprint arXiv:2411.16750},
  year={2024}
}