Extend3D: Town-scale 3D Generation

CVPR 2026
Seoul National University

Extend3D is a training-free pipeline to generate a large-scale 3D scene from a single image.

Abstract

In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the x and y directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.

Town-scale Scene Generation

Our pipeline can generate town-scale 3D scenes as in the example above. These results are 6×6 wider than the object-centric baseline (Trellis). Even wider scenes can be generated by further extending the latent space.



Comparison to Previous Methods

Our method can generate diverse scenes regardless of indoor or outdoor environments. Compared to previous methods, our method generated more consistent and higher-quality scenes. In the paper, human preference studies and quantitative evaluations are provided to support this claim.



3D from a Text Prompt

We compared our method with state-of-the-art text-to-3D-scene generation pipeline, SynCity. For a fair comparison, we used prompts in the official SynCity repository. As our method is image conditioned, we first generated an image from the text then generated a 3D scene from the image. Unlike SynCity, our results were less grid-like and more compatible to the prompts.



Method Overview

Method overview figure.

BibTeX

@inproceedings{yoon2026extend3d,
  author    = {Yoon, Seungwoo and Kim, Jinmo and Park, Jaesik},
  title     = {Extend3D: Town-scale 3D Generation},
  booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference},
  year      = {2026},
}