SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

1Technical University of Munich
2Carnegie Mellon University

Scene Generation from Text Input

SceneFactor factors the complex task of text-guided 3D scene generation into forming a coarse semantic structure, followed by refined geometric synthesis. Rather than require a learned model to decide the location, type, size, and local geometry of scene elements directly, our generation of a coarse semantic box layout enables training a simpler task of layout-guided geometric synthesis

Intuitive Scene Editing

SceneFactor enables seamless localized editing through easy manipulation of the 3D semantic box map. We demonstrate the addition of objects (adding boxes), moving objects (moving an existing semantic box), changing object size (scaling an existing semantic box), replacing objects (replacing an existing object box with a new one of a different category), and removing objects (removing an existing semantic box). Note that the rest of the 3D scene remains consistent outside of the editing region

Abstract

We present SceneFactor, a diffusion-based approach for large-scale 3D scene generation that enables controllable generation and effortless editing. SceneFactor enables text-guided 3D scene synthesis through our factored diffusion formulation, leveraging latent semantic and geometric manifolds for generation of arbitrary-sized 3D scenes.

While text input enables easy, controllable generation, text guidance remains imprecise for intuitive, localized editing and manipulation of the generated 3D scenes. Our factored semantic diffusion generates a proxy semantic space composed of semantic 3D boxes that enables controllable editing of generated scenes by adding, removing, changing the size of the semantic 3D proxy boxes that guides high-fidelity, consistent 3D geometric editing.

Extensive experiments demonstrate that our approach enables high-fidelity 3D scene synthesis with effective controllable editing through our factored diffusion approach.

Video

Diffusion-based Factored 3D Scene Generation

We formulate text-guided 3D scene generation as a factored diffusion process, first generating a coarse semantic box layout representing the text input (left), followed by synthesis of scene geometry corresponding to the generated semantics (right). This factorization makes complex 3D scene generation more tractable and enables generation of locally editable 3D scenes, which can be manipulated through box manipulations in the semantic maps.

Large-scale Scene Generation from Text Input

Qualitative comparisons to state-of-the-art diffusion-based 3D scene generative approaches BlockFusion, and SDFusion. Our approach produces improved scene geometry and more cohesive global scene structure with consistent walls compared to baselines.

*Note that results for BlockFusion are generated unconditionally

Possible Editing Scenarios

Additional Scene Editing Results

Generated scenes and their corresponding semantic maps are shown in the first and third row, and alternatives for each object synthesis-based edit are shown in the second and fourth row.

Scene Chunk Generation from Text Input

Qualitative comparison with state of the art on text-guided scene chunk generation using Qwen1.5 captions. In comparison with PVD, NFD, SDFusion, and BlockFusion SceneFactor generates higher-fidelity, more coherent scene structures through our factored approach.

*Note that results for BlockFusion are generated unconditionally

Perceptual Study

Perceptual study of the quality of text-guided 3D indoor scene generation and editing. (a) Unary study on perceptual geometric quality and text consistency for generated chunks and scenes. (b) Unary study on editing quality and scene consistency for SceneFactor. (c) Binary study between SceneFactor and baselines on text consistency between captions and generated chunks. (d) Binary study between SceneFactor and baselines on perceptual geometric quality of generated chunks. (e) Unary study of SceneFactor for locality of edits.

*Note that results for BlockFusion are generated unconditionally

BibTeX


      @misc{bokhovkin2024scenefactor,
          title     = {SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation},
          author    = {Bokhovkin, Alexey and Meng, Quan and Tulsiani, Shubham and Dai, Angela},
          journal   = {arxiv:2412.01801},
          year      = {2024}
      }