World Models and World Action Models (WAM)

Abstract

World models—internal predictive representations that enable agents to simulate future states, anticipate consequences, and plan actions—have emerged as a foundational paradigm in embodied artificial intelligence. Originating from model-based reinforcement learning, this field has undergone a radical transformation with the advent of large-scale generative models, blurring the historical boundary between passive video prediction and interactive physical simulation. Concurrently, Vision-Language-Action (VLA) models have established a powerful framework for grounding high-level linguistic intent in low-level motor control. The natural convergence of these two threads—predictive world simulation and action-grounded multimodal reasoning—has given rise to Embodied World Action Models (WAMs), representing a new frontier in which agents learn to act by imagining their futures. However, the explosive growth of methods across robotics, autonomous driving, and interactive simulation has produced a fragmented landscape that lacks systematic unification.

This survey presents a comprehensive and structured review of the modern world model ecosystem, encompassing 200+ key papers organized into a unified taxonomy. We systematically cover six major pillars: (i) Foundation World Models, including general-purpose interactive simulators (Genie, Cosmos, Sora) and game-specific environments (Oasis, Matrix-Game); (ii) Vision-Language-Action Models, spanning foundational architectures (RT-2, π₀, OpenVLA), driving-specific VLAs, and embodied manipulation policies; (iii) Embodied World Action Models, unifying video generation and action prediction through zero-shot policies, controllable simulation platforms, and world model-based reinforcement learning; (iv) Autonomous Driving World Models, addressing video generation, closed-loop simulation, planning policies, and geometric occupancy/BEV representations; (v) Efficiency and Evaluation, covering computational acceleration techniques and benchmarking protocols for physical plausibility; and (vi) Datasets and Ecosystems, including large-scale robot learning corpora and industry technical reports that underpin the entire field.

Through this organization, we illuminate the evolutionary trajectory from passive pixel predictors to active, reasoning, and action-grounded simulators. We identify critical open challenges—including physical consistency, cross-embodiment generalization, safety verification, and the sim-to-real evaluation gap—and outline future directions toward cognitive world models, autonomous data collection, and standardized open ecosystems. This survey aims to serve as a definitive reference for researchers and practitioners advancing the next generation of embodied intelligence.

Paper Overview

Eight sections covering the full spectrum from foundation simulation to embodied action.

Section 1

Introduction

Evolution from recurrent world models to large-scale generative simulators and the convergence of world models with VLA into WAMs.

Section 2

Foundation World Models

General-purpose interactive simulators (Genie, Cosmos, Sora) and game-specific environments (Oasis, Matrix-Game, GameGen-X).

Section 3

Vision-Language-Action Models

Foundational architectures (RT-2, π₀, OpenVLA), driving-specific VLAs, and embodied manipulation policies (Diffusion-VLA, 3D-VLA, RDT).

Section 4

Embodied World Action Models

Unified video-action pretraining (DreamZero, Unified World Models), controllable simulation (RoboScape), and policy optimization (Cosmos-Policy, ThinkAct).

Section 5

Autonomous Driving World Models

Driving video generation (DriveDreamer, GAIA-1), closed-loop simulation (DriveArena, Epona), planning (GenAD, DOE-1), and occupancy/BEV (OccWorld, HERMES).

Section 6

Efficiency and Evaluation

Computational acceleration (FAST, VLA-Cache, MoLe-VLA, TinyVLA) and benchmarks (WorldScore, WorldEval, AutoEval).

Section 7

Datasets and Ecosystems

Large-scale robot learning corpora (DROID, BridgeData, LIBERO, BEHAVIOR-1K) and industry technical reports (LingBot, GR-3, AgiBot).

Section 8

Conclusion & Future Directions

Open challenges in physical consistency, cross-embodiment generalization, safety verification, sim-to-real evaluation, and standardized open ecosystems.