Stereo World Model

Stereo World Model
Camera-Guided Stereo Video Generation

CVPR 2026

Yang-Tian Sun¹,

Zehuan Huang^2†,

Yifan Niu²,

Lin Ma³,

Yan-Pei Cao²,

Yuewen Ma³,

Xiaojuan Qi¹

¹The University of Hong Kong, ²VAST, ³ByteDance Pico

^†Project Lead, Corresponding Author

arXiv Code

Huggingface

TL, DR: We introduce StereoWorld, a stereo world model capable of performing exploration based on given binocular images, generating view-consistent stereo videos with intrinsic geometric understanding.

Motivation

Stereo vision – the dominant perceptual mechanism in many biological systems provides direct, robust geometric cues to 3D scene structure.
Compared to RGB-D systems, it avoids producing and stabilizing explicit metric depth maps while retaining strong geometric signals.

Pipeline

Illustration of StereoWorld. Given a pair of stereo images and a conditional camera trajectory, StereoWorld first encodes conditional and noisy video latents from different viewpoints and timesteps using a unified camera–frame RoPE representation. It then performs denoising through a DiT equipped with stereo attention, ultimately producing the final stereo video.

Stereo Video Generation

Stereo Depth Video

Each video shows stereo video (left) side-by-side with the corresponding depth estimation (right).

Long Video Distillation via Self-Forcing

Embodied AI Video

Each video shows stereo video (left) side-by-side with the corresponding depth estimation (right).

Extension to Multi-view Video Generation

Our proposed Unified Camera-Frame RoPE is not only effective for stereo generation, but can also be directly extended to multi-view video generation. After embedding our method into the Wan model, we can directly obtain the following result without any additional training. This further demonstrates both the effectiveness of our method and its strong ability to leverage video model priors.

BibTeX

@article{sun2026stereo,
        title={Stereo World Model: Camera-Guided Stereo Video Generation},
        author={Sun Yang-Tian and Huang Zehuan and Niu Yifan and Ma Lin and Cao Yan-Pei and Ma Yuewen and Qi Xiaojuan},
        journal={arXiv preprint arXiv:2603.17375},
        year={2026}
      }

Acknowledgements

The website template is borrowed from Nerfies.

Stereo World Model Camera-Guided Stereo Video Generation