Physical-Informed Driving World Model

Abstract

Autonomous driving requires robust perception models trained on high-quality, large-scale multi-view driving videos for tasks like 3D object detection, segmentation and trajectory prediction. While world models provide a cost-effective solution for generating realistic driving videos, challenges remain in ensuring these videos adhere to fundamental physical principles, such as relative and absolute motion, spatial relationship like occlusion and spatial consistency, and temporal consistency. To address these, we propose DrivePhysica, an innovative model designed to generate realistic multi-view driving videos that accurately adhere to essential physical principles through three key advancements: (1) a Coordinate System Aligner module that integrates relative and absolute motion features to enhance motion interpretation, (2) an Instance Flow Guidance module that ensures precise temporal consistency via efficient 3D flow extraction, and (3) a Box Coordinate Guidance module that improves spatial relationship understanding and accurately resolves occlusion hierarchies. Grounded in physical principles, we achieve state-of-the-art performance in driving video generation quality (3.96 FID and 38.06 FVD on the Nuscenes dataset) and downstream perception tasks.

DrivePhysica Overview DrivePhysica method

1. Scenario Simulation Using Carla-Generated Layouts

DrivePhysica can generate driving videos based on layout conditions provided by the Carla Simulator.

1.1 Corner Case in Autonomous Driving

DrivePhysica can generate rare but critical driving scenarios based on the Carla-Generated layout conditions.
The use of Carla-generated layouts addresses a critical limitation in real-world driving video datasets:
The lack of diversity in scene types, especially for rare or challenging corner cases.

1.2 Long-term Generation (2x speed)

DrivePhysica can simulate long-duration driving scenarios based on the Carla-Generated layout conditions.
We generates new plausible content on the fly and maintains a consistent world for up to 1 min.
This duration significantly exceeds that of videos in the NuScenes dataset.

2. Multimodal Condition Controllability

2.1 Text Prompt Editing

DrivePhysica can generate diverse weather scenarios from the same control conditions.
It means it is possible to simulate extreme weather conditions for training perception models.
We append descriptors like "sunny," "rainy," or "night" to text prompts for video editing.
The videos below show the "control condition", followed by "sunny", "rainy", and "night" scenarios.

2.2 Layout Controllability

DrivePhysica responds precisely to control conditions like box projection, map projection, and instance flow.
We overlay the 3D bounding box projections onto the generated videos.

3. Qualitative Comparison

3.1 Comparison with Baseline

3.2 More Results

3. Stochastic Diversity of Generation

DrivePhysica can generate diverse videos using varying stochastic noise inputs and the same control conditions.
The videos below show the "control condition" and two sampled videos with different stochastic noise inputs in each line.
Both sampled videos in the second and third line adhere to the constraints defined in the first line.