MVTokenFlow: High-quality 4D Content Generation using Multiview Token Flow

ICLR 2025

1ShanghaiTech, 2HKUST, 3HKU
*Equal contribution. Corresponding author.

Abstract

In this paper, we present MVTokenFlow for high-quality 4D content creation from monocular videos. Recent advancements in generative models such as video diffusion models and multiview diffusion models enable us to create videos or 3D models. However, extending these generative models for dynamic 4D content creation is still a challenging task that requires the generated content to be consistent spatially and temporally. To address this challenge, MVTokenFlow utilizes the multiview diffusion model to generate multiview images on different timesteps, which attains spatial consistency across different viewpoints and allows us to reconstruct a reasonable coarse 4D field. Then, MVTokenFlow further regenerates all the multiview images using the rendered 2D flows as guidance. The 2D flows effectively associate pixels from different timesteps and improve the temporal consistency by reusing tokens in the regeneration process. Finally, the regenerated images are spatiotemporally consistent and utilized to refine the coarse 4D field to get a high-quality 4D field. Experiments demonstrate the effectiveness of our design and show significantly improved quality than baseline methods.

teaser image.

Given an input monocular video containing a foreground dynamic object (left), MVTokenFlow generates a 4D video represented by a dynamic 3D Gaussian field (right) by utilizing a multiview diffusion model and a token propagation method to improve both the spatial and temporal consistency. On the right, we also show the colors of these Gaussian spheres and the rendered normal maps besides the rendered RGB images.

Method

pipeline.

Overview. Given an input video that can be generated by video diffusion models, we first apply the Era3D to generate the multiview-consistent images and normal maps for each timestep. Then, we reconstruct a coarse dynamic 3D Gaussian field field from the generated multiview images. After that, we use the coarse dynamic 3D Gaussian field to render 2D flows to guide the re-generation of the multiview images of Era3D, which greatly improves the temporal consistency and image quality. Finally, the regenerated images are used in the refinement of our dynamic 3D Gaussian field to improve the quality.

Results

results.

Qualitative comparison on spatial consistency of our method with baseline methods.

Flow map

add in.
add out.

Video-to-4D Results

View 1 Optical Flow 1 View 2 Optical Flow 2









View 1 Optical Flow 1 View 2 Optical Flow 2









BibTeX