MERGE

More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

Huazhong University of Science & Technology

^† Corresponding author.

Abstract

Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degradation in the image generation capability of the pre-trained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model. MERGE demonstrates that the pre-trained text-to-image model can do more than image generation, but also expand to depth estimation effortlessly. Specifically, MERGE introduces a play-and-plug framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and improve the utilization of the additional learnable parameters. MERGE unleashes the powerful depth estimation capability of the pre-trained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of-the-art performance across multiple depth estimation benchmarks.

Overview

In this work, to overcome the catastrophic degradation in the image generation capability of generative depth estimation methods, we present a simple and elegant play-and-plug framework. Without relying on large-scale data-driven training or complex architectural designs, our method expands the pre-trained text-to-image model with depth estimation capability by incorporating two core components: pluggable converters and the group reuse mechanism, while preserving its original image generation ability.

BibTeX

@inproceedings{lin2025merge, title={More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models}, author={Lin, Hongkai and Liang, Dingkang and Mingyang Du and Xin Zhou and Bai, Xiang}, booktitle={Advances in Neural Information Processing Systems}, year={2025}, }

More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

Abstract

Overview

Pipeline

Experiments

Zero-shot Affine-invariant Depth Estimation

Zero-shot Surface Normal Estimation

BibTeX