Splatter a Video: Video Gaussian Representation for Versatile Processing

Splatter a Video: Video Gaussian Representation for Versatile Processing

NeurIPS 2024
1The University of Hong Kong, 2VAST,
*Equal Contribution, #Corresponding Author

TL, DR: We propose a method to represent wild videos using 3D Gaussians, without explicit camera dependency, for versatile video processing tasks

Abstract

Video representation is a long-standing problem that is crucial for various down- stream tasks, such as tracking, depth prediction, segmentation, view synthesis, and editing. However, current methods either struggle to model complex motions due to the absence of 3D structure or rely on implicit 3D representations that are ill-suited for manipulation tasks. To address these challenges, we introduce a novel explicit 3D representation—video Gaussian representation—that embeds a video into 3D Gaussians. Our proposed representation models video appearance in a 3D canonical space using explicit Gaussians as proxies and associates each Gaussian with 3D motions for video motion. This approach offers a more intrinsic and explicit representation than layered atlas or volumetric pixel matrices. To obtain such a representation, we distill 2D priors, such as optical flow and depth, from foundation models to regularize learning in this ill-posed setting. Extensive applications demonstrate the versatility of our new video representation. It has been proven effective in numerous video processing tasks, including tracking, consistent video depth and feature refinement, motion and appearance editing, and stereoscopic video generation.

Video Gaussian Representation

Dense Video Tracking

Consistent feature

Geometry editing

Appearance editing

Novel view synthesis

Video frame interpolation

Quantative Comparison on Tap-Vid DAVIS

Methods PSNR↑ SSIM↑ LPIPS↓ AJ ↑ xavg OA ↑ TC ↓ Training Time GPU Memory FPS
4DGS 18.12 0.5735 0.5130 5.1 10.2 75.45 8.11 40 mins 10G 145.8
RoDynRF 24.79 0.723 0.394 \ \ \ \ > 24 hours 24G <0.01
Deformable Sprites 22.83 0.6983 0.3014 20.6 32.9 69.7 2.07 30 mins 24G 1.6
Omnimotion 24.11 0.7145 0.3713 51.7 67.5 85.3 0.74 > 24 hours 24G <0.01
CoDeF 26.17 0.8160 0.2905 7.6 13.7 78.0 7.56 ~30 mins 10G 8.8
Ours 28.63 0.8373 0.2283 41.9 57.7 79.2 1.82 ~30 mins 10G 149

Comparison with other methods

BibTeX

@article{sun2024sav,
        title={Splatter a Video: Video Gaussian Representation for Versatile Processing,
        author={Sun, Yang-Tian, and Huang, Yi-Hua and Ma, Lin and Lyu, Xiaoyang and Cao, Yan-Pei and Qi, Xiaojuan},
        journal={arXiv preprint arXiv:2312.14937},
        year={2023}
      }

Acknowledgements

The website template is borrowed from Nerfies.