More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

Huazhong University of Science & Technology
Corresponding author.

Abstract

Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degradation in the image generation capability of the pre-trained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model. MERGE demonstrates that the pre-trained text-to-image model can do more than image generation, but also expand to depth estimation effortlessly. Specifically, MERGE introduces a play-and-plug framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and improve the utilization of the additional learnable parameters. MERGE unleashes the powerful depth estimation capability of the pre-trained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of-the-art performance across multiple depth estimation benchmarks.

Overview

MY ALT TEXT

In this work, to overcome the catastrophic degradation in the image generation capability of generative depth estimation methods, we present a simple and elegant play-and-plug framework. Without relying on large-scale data-driven training or complex architectural designs, our method expands the pre-trained text-to-image model with depth estimation capability by incorporating two core components: pluggable converters and the group reuse mechanism, while preserving its original image generation ability.

Pipeline

MY ALT TEXT

Experiments

Zero-shot Affine-invariant Depth Estimation

Interpolate start reference image.

Zero-shot Surface Normal Estimation

Interpolate start reference image.

BibTeX


    @inproceedings{lin2025merge,
          title={More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models},
          author={Lin, Hongkai and Liang, Dingkang and Mingyang Du and Xin Zhou and Bai, Xiang},
          booktitle={Advances in Neural Information Processing Systems},
          year={2025},
    }