Physical-Informed Driving World Model

Abstract

Autonomous driving requires robust perception models trained on high-quality, large-scale multi-view driving videos for tasks like 3D object detection, segmentation and trajectory prediction. While world models provide a cost-effective solution for generating realistic driving videos, challenges remain in ensuring these videos adhere to fundamental physical principles, such as relative and absolute motion, spatial relationship like occlusion and spatial consistency, and temporal consistency. To address these, we propose DrivePhysica, an innovative model designed to generate realistic multi-view driving videos that accurately adhere to essential physical principles through three key advancements: (1) a Coordinate System Aligner module that integrates relative and absolute motion features to enhance motion interpretation, (2) an Instance Flow Guidance module that ensures precise temporal consistency via efficient 3D flow extraction, and (3) a Box Coordinate Guidance module that improves spatial relationship understanding and accurately resolves occlusion hierarchies. Grounded in physical principles, we achieve state-of-the-art performance in driving video generation quality (3.96 FID and 38.06 FVD on the Nuscenes dataset) and downstream perception tasks.

DrivePhysica Overview

DrivePhysica method

1. Scenario Simulation Using Carla-Generated Layouts

DrivePhysica can generate driving videos based on layout conditions provided by the Carla Simulator.

1.1 Corner Case in Autonomous Driving

DrivePhysica can generate rare but critical driving scenarios based on the Carla-Generated layout conditions.
The use of Carla-generated layouts addresses a critical limitation in real-world driving video datasets:
The lack of diversity in scene types, especially for rare or challenging corner cases.

a. The vehicle ahead brakes, prompting the ego vehicle to decelerate and stop.

b. Vehicle cutting in from the right lane.

1.2 Long-term Generation (2x speed)

DrivePhysica can simulate long-duration driving scenarios based on the Carla-Generated layout conditions.
We generates new plausible content on the fly and maintains a consistent world for up to 1 min.
This duration significantly exceeds that of videos in the NuScenes dataset.

2. Multimodal Condition Controllability

2.1 Text Prompt Editing

DrivePhysica can generate diverse weather scenarios from the same control conditions.
It means it is possible to simulate extreme weather conditions for training perception models.
We append descriptors like "sunny," "rainy," or "night" to text prompts for video editing.
The videos below show the "control condition", followed by "sunny", "rainy", and "night" scenarios.

Control Condition

Sunny, Busy street, parked cars, parking lot, ped sitting at bus stop, motorcycle driving.

"Sunny": clear skies with sunlight shining on the scene, reflecting bright and vivid environmental details.

Rainy, Busy street, parked cars, parking lot, ped sitting at bus stop, motorcycle driving.

"Rainy": wet road surfaces and blurred camera views caused by raindrops, adding realistic weather dynamics.

Night, Busy street, parked cars, parking lot, ped sitting at bus stop, motorcycle driving.

"Night": dimly lit scenes with streetlights and reduced visibility, accurately simulating nighttime driving conditions.

2.2 Layout Controllability

DrivePhysica responds precisely to control conditions like box projection, map projection, and instance flow.
We overlay the 3D bounding box projections onto the generated videos.

Oncoming bus, parked car on center divider, pedestrian, roundabout, parked trucks.

Drivable areas, sidewalks, and zebra crossings are faithfully generated following the road map projections.
Objects in the scene are accurately placed and sized to align with their projected bounding boxes.

Wait at intersection, peds on sidewalk, turn right, cones, cross intersection.

Small and densely packed objects are precisely rendered at their correct locations, following 3D bounding box coordinates.
Objects track their previous attributes as guided by the instance flow, ensuring temporal consistency across frames.

3. Qualitative Comparison

3.1 Comparison with Baseline

Realistic drive view.

Marching Cube mesh of seg153495

Realistic drive view.

a. Temporal Consistency. In Panacea, the direction of the white car's front head(in FrontLeft and BackLeft view) change over time. In DrivePhysica, our model preserves the white car's attributes, demonstrating superior temporal consistency.

Realistic drive view.

Marching Cube mesh of seg153495

Realistic drive view.

b. Occlusion Hierachy. The stationary box (in FrontLeft view in the parking space) is positioned farther from the ego vehicle, while the moving box is closer. In Panacea, it incorrectly places the farther stationary box in front, obstructing the closer moving car, therefore violating the expected occlusion hierarchy. In DrivePhysica, we correctly render the closer moving box in front, with the farther stationary box appropriately occluded.

Realistic drive view.

Marching Cube mesh of seg153495

Realistic drive view.

c. Relative Motion Understanding. As the ego vehicle moves forward, the background and foreground cars should appear to move backward relative to it. In Panacea, the black car(in FrontRight and BackRight view) does not move backward as expected relative to the ego vehicle, failing to learn correct relative motion laws. In DrivePhysica, we accurately capture the relative motion of each instance.

Realistic drive view.

Marching Cube mesh of seg153495

Realistic drive view.

d. Spatial Consistency. In Panacea, the white truck (in FrontRight and BackRight view) exhibits different shapes in different views. In DrivePhysica, our model maintains consistent spatial representation across views.

3.2 More Results

a. Temporal Consistency.

b. Occlusion Hierachy.

c. Relative Motion Understanding.

d. Spatial Consistency.

3. Stochastic Diversity of Generation

DrivePhysica can generate diverse videos using varying stochastic noise inputs and the same control conditions.
The videos below show the "control condition" and two sampled videos with different stochastic noise inputs in each line.
Both sampled videos in the second and third line adhere to the constraints defined in the first line.

Control Condition

Control Condition

Oncoming bus, parked car on center divider, pedestrian, roundabout, parked trucks.

Sample 1

Ped walking along the road, bus stops with waiting passenger, dense parked cars at the left.

Sample 1

Oncoming bus, parked car on center divider, pedestrian, roundabout, parked trucks.

Sample 2

Ped walking along the road, bus stops with waiting passenger, dense parked cars at the left.

Sample 2