Iris: Integrating Language into Diffusion-based Monocular Depth Estimation

Accepted by CVPR 2026

Ziyao Zeng*1, Jingcheng Ni*2, Daniel Wang1, Patrick Rim1, Younjoon Chung1, Fengyu Yang1, Byung-Woo Hong3, Alex Wong1

1 Yale University   2 Brown University   3 Chung-Ang University
* Equal contribution

Abstract

Traditional monocular depth estimation suffers from inherent ambiguity and visual nuisances. We demonstrate that language can enhance monocular depth estimation by providing an additional condition (rather than images alone) aligned with plausible 3D scenes, thereby reducing the solution space for depth estimation. This conditional distribution is learned during the text-to-image pre-training of diffusion models. To generate images under various viewpoints and layouts that precisely reflect textual descriptions, the model implicitly models object sizes, shapes, and scales, their spatial relationships, and the overall scene structure. In this paper, Iris, we investigate the benefits of our strategy to integrate text descriptions into training and inference of diffusion-based depth estimation models. We experiment with three different diffusion-based monocular depth estimators (Marigold, Lotus, and E2E-FT) and their variants. By training on HyperSim and Virtual KITTI, and evaluating on NYUv2, KITTI, ETH3D, ScanNet, and DIODE, we find that our strategy improves the overall monocular depth estimation accuracy, especially in small areas. It also improves the model's depth perception of specific regions described in the text. We find that by providing more details in the text, the depth prediction can be iteratively refined. Simultaneously, we find that language can act as a constraint to accelerate the convergence of both training and the inference diffusion trajectory. Code and generated text data will be released upon acceptance.

Teaser

Iris teaser: language conditions reduce depth solution space

Integrating language into diffusion models enhances monocular depth estimation by providing an additional condition associated with plausible 3D scenes, thus reducing the solution space. The conditional distribution is learned during text-to-image pre-training and associated with depth during fine-tuning with image-text-depth pairs.

Key Findings

Pipeline

Pipeline: text-conditioned diffusion for depth

We train the diffusion model to predict noise in the noisy depth latent conditioned on the input image and language description. At inference, the model denoises from Gaussian noise to a depth latent, then decodes to the depth map via a frozen VAE decoder.

Results

Integrating language consistently improves Marigold, Lotus-D, Lotus-G, and E2E-FT across NYUv2, KITTI, ETH3D, ScanNet, and DIODE. Below: selected metrics (δ₁↑, AbsRel↓). Our “Train & Infer” text variants are highlighted.

Method NYUv2 δ₁↑ NYUv2 AbsRel↓ KITTI δ₁↑ KITTI AbsRel↓ ETH3D δ₁↑ ScanNet δ₁↑ DIODE δ₁↑
Marigold95.96.090.410.595.194.577.2
Marigold + Text (Train & Infer)95.95.990.610.495.794.978.9
Lotus-D*96.65.692.28.796.896.074.1
Lotus-D + Text (Train & Infer)96.85.493.08.497.096.674.2
Lotus-G*95.26.792.28.995.793.771.7
Lotus-G + Text (Train & Infer)96.35.992.88.696.395.372.5
E2E-FT*95.46.990.110.594.194.676.4
E2E-FT + Text (Train & Infer)96.36.291.79.794.795.077.0

Qualitative Results

Language improves depth for specified or ambiguous regions (e.g. soap dispenser, lamps, kitchen in background, parked car, distant signs). Iterative refinement with more detailed text further improves specified regions.

NYUv2 visualizations

NYUv2: better depth for objects mentioned in the text (e.g. soap dispenser, lamps).

KITTI visualizations

KITTI: improved depth for described objects (e.g. parked car, distant sign).

Iterative refinement with language

Iterative depth refinement with more detailed language.

Language improves depth for specified regions

Language improves depth perception of specified regions (e.g. kitchen in background).

Convergence & Efficiency

Training convergence

Integrating language accelerates training convergence.

Fewer denoising steps

Fewer denoising steps needed at inference with language (e.g. ~10 vs ~25 steps).

BibTeX

@article{zeng2024iris,
  title={Iris: Integrating Language into Diffusion-based Monocular Depth Estimation},
  author={Zeng, Ziyao and Ni, Jingcheng and Wang, Daniel and Rim, Patrick and Chung, Younjoon and Yang, Fengyu and Hong, Byung-Woo and Wong, Alex},
  journal={arXiv preprint arXiv:2411.16750},
  year={2024}
}