SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

TL;DR

SCOPE extends a pre-trained video diffusion transformer into an interactive FPS world simulator through end-to-end joint training on CrossFPS—an action-annotated multi-game dataset without human bias spanning seven titles. It incorporates a spatial action decoupling module that enables each spatial position to independently determine its visual response based on local context. By mapping visual features to action responses rather than memorizing game-specific assets, SCOPE achieves zero-shot generalization to unseen first-person environments with robust action responsiveness under high-frequency, high-intensity control signals.

Method

SCOPE inserts a lightweight spatial action decoupling module into each DiT block of a pretrained video diffusion transformer. The module reshapes visual tokens into per-pixel temporal sequences and routes actions through dual pathways: discrete events (fire, reload, etc.) use cross-attention with visual queries to confine effects to in-scope regions, while continuous controls (camera, movement) use MLP fusion and temporal self-attention for smooth out-of-scope ego-motion. All output projections are zero-initialized so training begins from the unmodified video generator.

Figure 1. SCOPE architecture. A SCOPE module is inserted into each DiT block. Discrete inputs use cross-attention with visual queries to confine effects to in-scope regions. Continuous inputs use MLP fusion and temporal self-attention for out-of-scope generation. Pathways combine via residual connections.

The model is trained end-to-end on CrossFPS—a multi-game action-annotated dataset spanning seven FPS titles (69K clips) with frame-aligned 10-DoF gamepad telemetry captured without human bias. Stochastic action dropout during training enables Action Classifier-Free Guidance (Action-CFG) at inference for tunable action intensity.

Figure 2. CrossFPS overview. Clip distribution across seven FPS titles (69K total) with frame-aligned 10-DoF gamepad telemetry.

1. Action Space

SCOPE uses a Gamepad 10-DoF action interface with four functional groups: two continuous axes for movement, two continuous axes for camera control, three discrete combat buttons, and three discrete utility buttons. The telemetry format below matches the action supervision and overlay visualization used throughout the demos.

Action Telemetry Format · 10 Dimensions in 4 Groups

Movement: LX / LY

LXMove left / right

LYMove forward / back

Camera: RX / RY

RXTurn left / right

RYLook up / down

Combat: LT / RT / R3

RTFire

LTAim down sights (ADS)

R3Melee

Utility: A / X / Y

AJump

XReload

YSwitch weapon

2. Action Controllability in Unseen Scenes

A central claim of SCOPE is that mapping visual features to action responses—rather than memorizing game-specific assets—enables effective transfer to unseen scenes. We evaluate on first-person frames synthesized by GPT-image-2 spanning game aesthetics entirely absent from training, focusing on multi-action compositions that test the spatial action decoupling module under concurrent gamepad inputs.

Multi-Action Compositions

Combinations of simultaneous or sequential controls: LT+RT, RT+X, LX/LY+RT. These test the spatial action decoupling module's ability to handle overlapping gamepad control channels. Note: The action sequences shown below display only the discrete button events for clarity; continuous movement signals (LX/LY) that run in parallel throughout the clips are omitted from the visualization.

3. Zero-Shot Visual Generalization

Despite training exclusively on FPS gameplay data from seven titles, SCOPE generalizes to drastically different visual styles—from cartoon to anime to photorealistic—without any fine-tuning. Each video below is generated from a single context frame in an unseen environment. Highlight clips (orange border) are 12-frame excerpts that isolate action-environment interactions—the most challenging evaluation axis—within each style.

It Takes Two — Cartoon Style

Genshin Impact — Anime Style

Sample 3 · Highlight is a 12-frame excerpt demonstrating Object Interaction—RT applied to objects produces localized deformation, particle effects, and physics responses, requiring geometric transformation beyond texture synthesis.

Black Myth: Wukong — Cinematic Realism

Sample 7 · Highlight is a 12-frame excerpt demonstrating Environment Destruction—sustained RT activation on environmental elements (trees, structures, terrain) produces progressive destruction effects, confirming that discrete events propagate to the correct spatial regions.

Custom Desert — Stylized Realism

Sample 11 · Highlight is a 12-frame excerpt demonstrating NPC Hit Response—RT directed at NPCs produces hit markers, blood effects, and stagger animations. The model infers target regions from visual context without explicit segmentation.

4. In-Distribution Results

On games seen during training, SCOPE maintains high action-response fidelity and visual coherence, validating that the unified model does not sacrifice in-distribution quality for cross-game generalization.

Modern Warfare

WorldCam

5. Comparison with Baselines

Side-by-side comparison between SCOPE and three baselines: Matrix-Game 3.0, LingBot-World (Act), and HY-World 1.5. All methods receive the same context frame and action sequence. Click the play button to synchronize playback across all four results.

Case 1 — Call of Duty: Modern Warfare III

SCOPE (Ours)