Splatter a Video: Video Gaussian Representation for Versatile Processing

NeurIPS 2024

¹The University of Hong Kong, ²VAST,

^*Equal Contribution, ^#Corresponding Author

Abstract

Video representation is a long-standing problem that is crucial for various down- stream tasks, such as tracking, depth prediction, segmentation, view synthesis, and editing. However, current methods either struggle to model complex motions due to the absence of 3D structure or rely on implicit 3D representations that are ill-suited for manipulation tasks. To address these challenges, we introduce a novel explicit 3D representation—video Gaussian representation—that embeds a video into 3D Gaussians. Our proposed representation models video appearance in a 3D canonical space using explicit Gaussians as proxies and associates each Gaussian with 3D motions for video motion. This approach offers a more intrinsic and explicit representation than layered atlas or volumetric pixel matrices. To obtain such a representation, we distill 2D priors, such as optical flow and depth, from foundation models to regularize learning in this ill-posed setting. Extensive applications demonstrate the versatility of our new video representation. It has been proven effective in numerous video processing tasks, including tracking, consistent video depth and feature refinement, motion and appearance editing, and stereoscopic video generation.

Quantative Comparison on Tap-Vid DAVIS

Methods	PSNR↑	SSIM↑	LPIPS↓	AJ ↑	<δ^x_avg ↑	OA ↑	TC ↓	Training Time	GPU Memory	FPS
4DGS	18.12	0.5735	0.5130	5.1	10.2	75.45	8.11	40 mins	10G	145.8
RoDynRF	24.79	0.723	0.394	\	\	\	\	> 24 hours	24G	<0.01
Deformable Sprites	22.83	0.6983	0.3014	20.6	32.9	69.7	2.07	30 mins	24G	1.6
Omnimotion	24.11	0.7145	0.3713	51.7	67.5	85.3	0.74	> 24 hours	24G	<0.01
CoDeF	26.17	0.8160	0.2905	7.6	13.7	78.0	7.56	~30 mins	10G	8.8
Ours	28.63	0.8373	0.2283	41.9	57.7	79.2	1.82	~30 mins	10G	149

Methods

PSNR↑

SSIM↑

LPIPS↓

AJ ↑

<δ^x_avg ↑

OA ↑

TC ↓

Training Time

GPU Memory

FPS

4DGS

18.12

0.5735

0.5130

5.1

10.2

75.45

8.11

40 mins

10G

145.8

RoDynRF

24.79

0.723

0.394

> 24 hours

24G

<0.01

Deformable Sprites

22.83

0.6983

0.3014

20.6

32.9

69.7

2.07

30 mins

24G

1.6

Omnimotion

24.11

0.7145

0.3713

51.7

67.5

85.3

0.74

> 24 hours

24G

<0.01

CoDeF

26.17

0.8160

0.2905

7.6

13.7

78.0

7.56

~30 mins

10G

8.8

Ours

28.63

0.8373

0.2283

41.9

57.7

79.2

1.82

~30 mins

10G

149

BibTeX

@article{sun2024sav, title={Splatter a Video: Video Gaussian Representation for Versatile Processing, author={Sun, Yang-Tian, and Huang, Yi-Hua and Ma, Lin and Lyu, Xiaoyang and Cao, Yan-Pei and Qi, Xiaojuan}, journal={arXiv preprint arXiv:2312.14937}, year={2023} }

Splatter a Video: Video Gaussian Representation for Versatile Processing

Abstract

Video Gaussian Representation

Dense Video Tracking

Consistent feature

Geometry editing

Appearance editing

Novel view synthesis

Video frame interpolation

Quantative Comparison on Tap-Vid DAVIS

Comparison with other methods

BibTeX

Acknowledgements