DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation


Wei Wu1,2, Xi Guo2, Weixuan Tang2, Tingxuan Huang3, Chiyu Wang2, Dongyue Chen3, Chenjing Ding2† 📧

1Tsinghua University, 2Sensetime Research, 3Northeastern University

[Paper]     [Code]

Recent advancements in generative models have provided promising solutions for synthesizing realistic driving videos, which are crucial for training autonomous driving perception models. However, existing approaches often struggle with multi-view video generation due to the challenges of integrating 3D information while maintaining spatial-temporal consistency and effectively learning from a unified model. We propose DriveScape, an end-to-end framework for multi-view, 3D condition-guided video generation, capable of producing 1024 x 576 high-resolution videos at 10Hz. Unlike other methods limited to 2Hz due to the 3D box annotation frame rate, DriveScape overcomes this with its ability to operate under sparse conditions. Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information, maintaining spatial-temporal consistency. DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39. We will make our code and pre-trained model publicly.

High Resolution Generating with Sparse Layout


Generate 10 Hz multiview high-resolution(1024*576) videos( 30 frames) with sparse conditions(2Hz).

First row: Generated panoramic video. Second row: Sparse layout condition, black represents no layout condition at that moment.



Method



Resolution and Feature Comparison


Resolution Comparison

Feature Comparison

Model Spatial Res. FPS Sparse Condition
Drive-WM 192 × 384 2 ×
WoVoGen 256 × 448 2 ×
Delphi 512 × 512 2 ×
GenAD 256 × 448 2 ×
MagicDrive 272 × 736 - ×
DriveDreamer 1&2 448 × 256 2 ×
DriveDiffusion 512 × 512 2 ×
Panacea 256 × 512 2 ×
Ours (DriveScape) 576 × 1024 2~10


Layout edit


The car crash videos we simulated on a private dataset by editting the layouts.

We simulate a scenario where the following vehicle rear-ends your car.

We simulate a rear-end collision scenario with the leading vehicle.

left:edited videos; middle: structured input; right: original videos;

Control by Text


We can generate diverse videos by controlling them through text, even when the structured input is identical.

The video shows a dark street at night with a car driving down the road. The street is illuminated by streetlights. The blue vehicle is passing a junction in the image. The ego vehicle is moving, as it is seen driving down the street in the video.

The video shows a city street with a mix of vehicles, including cars, and a truck .The weather appears to be cloudy, and the lighting is dim.

left:generated videos; middle: structured input; right: original videos;