UniScene: Unified Occupancy-centric Driving Scene Generation

1.Shanghai Jiao Tong University, 2.Eastern Institute of Technology, 3.Tsinghua University 4.MEGVII Technology, 5.Mach Drive, 6.Fudan University, 7.University of Hong Kong

Abstract

Generating high-fidelity, controllable, and annotated training data is critical for autonomous driving. Existing methods typically generate a single data form directly from a coarse scene layout, which not only fails to output rich data forms required for diverse downstream tasks but also struggles to model the direct layout-to-data distribution. In this paper, we introduce UniScene, the first unified framework for generating three key data forms — semantic occupancy, video, and LiDAR — in driving scenes. UniScene employs a progressive generation process that decomposes the complex task of scene generation into two hierarchical steps: (a) first generating semantic occupancy from a customized scene layout as a meta scene representation rich in both semantic and geometric information, and then (b) conditioned on occupancy, generating video and LiDAR data, respectively, with two novel transfer strategies of Gaussian-based Joint Rendering and Prior-guided Sparse Modeling. This occupancy-centric approach reduces the generation burden, especially for intricate scenes, while providing detailed intermediate representations for the subsequent generation stages. Extensive experiments demonstrate that UniScene outperforms previous SOTAs in the occupancy, video, and LiDAR generation, which also indeed benefits downstream driving tasks.


Teaser

Teaser

(a) Overview of UniScene. Given BEV layouts, UniScene facilitates versatile data generation, including semantic occupancy, multi-view video, and LiDAR point clouds, through an occupancy-centric hierarchical modeling approach. (b) Performance comparison on different generation tasks. UniScene delivers substantial improvements over SOTA methods in video, LiDAR, and occupancy generation.

Teaser

Versatile generation ability of UniScene. (a) Large-scale coherent generation of semantic occupancy, LiDAR point clouds, and multi-view videos. (b) Controllable generation of geometry-edited occupancy, video, and LiDAR by simply editing the input BEV layouts to convey user commands. (c) Controllable generation of attribute-diverse videos by changing the input text prompts.


Overview

Teaser

Overall framework of the proposed method. The joint generation process is organized into an occupancy-centric hierarchy: I. Controllable Occupancy Generation. The BEV layouts are concatenated with the noise volumes before being fed into the Occupancy Diffusion Transformer, and decoded with the Occupancy VAE Decoder. II. Occupancy-based Video and LiDAR Generation. The occupancy is converted into 3D Gaussians and rendered into semantic and depth maps, which are processed with additional encoders as in ControlNet. The output is obtained from the Video VAE Decoder. For LiDAR generation, the occupancy is processed via a sparse UNet and sampled with the geometric prior guidance, which is sent to the LiDAR head for generation.


Experimental Results

Qualitative Results

Quantitative Results

Teaser
Quantitative evaluation for occupancy reconstruction on the NuScenes-Occupancy validation set. The compression ratio is calculated following the methodology outlined in OccWorld.
Teaser
Quantitative evaluation for occupancy generation (`Ours-Gen.') and forecasting (`Ours-Fore.') on the NuScenes-Occupancy validation set. `Ours-Gen.' and `Ours-Fore.' denote our Generation model and Forecasting model, respectively. `CFG' refers to the Classifier-Free Guidance.
Teaser
Quantitative evaluation for video generation on the NuScenes validation set. We implement the multi-view variant of Vista* with spatial-temporal attention.
Teaser
Quantitative evaluation for LiDAR generation on the NuScenes validation set. We include the semantic occupancy generation time for a fair comparison.
Teaser
Quantitative evaluation about support for semantic occupancy prediction model (Baseline as CONet) on the NuScenes-Occupancy validation set. The `C', `L', and `L^D' denote the camera, LiDAR, and depth projected from LiDAR.

Citation

@article{li2024uniscene,
  title={UniScene: Unified Occupancy-centric Driving Scene Generation},
  author={Li, Bohan and Guo, Jiazhe and Liu, Hongsi and Zou, Yingshuang and Ding, Yikang and Chen, Xiwu and Zhu, Hu and Tan, Feiyang and Zhang, Chi and Wang, Tiancai and others},
  journal={arXiv preprint arXiv:2412.05435},
  year={2024}
}