VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE

Ground Truth (GT)	Reconstructed

Yazhou Xing*, Yang Fei*, Yingqing He*†, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen† (*equal contribution, †corresponding author) The Hong Kong University of Science and Technology

Project Page | Paper | High-Res Demo

A state-of-the-art Video Variational Autoencoder (VAE) designed for high-fidelity video reconstruction. This project leverages cross-modal and joint video-image training to enhance reconstruction quality.

✨ Features

High-Fidelity Reconstruction: Achieve superior image and video reconstruction quality.
Cross-Modal Reconstruction: Utilize captions to guide the reconstruction process.
State-of-the-Art Performance: Set new benchmarks in video reconstruction tasks.

📰 News

[Jan 2025] 🏋️ Released training code & better pretrained 4z-text weight
[Dec 2024] 🚀 Released inference code and pretrained models
[Dec 2024] 📝 Released paper on arXiv
[Dec 2024] 💡 Project page is live at VideoVAE+

⏰ Todo

Release Pretrained Model Weights
Release Inference Code
Release Training Code

🚀 Get Started

Follow these steps to set up your environment and run the code:

1. Clone the Repository

git clone https://github.com/VideoVerses/VideoVAEPlus.git
cd VideoVAEPlus

2. Set Up the Environment

Create a Conda environment and install dependencies:

conda create --name vae python=3.10 -y
conda activate vae
pip install -r requirements.txt

📦 Pretrained Models

Model Name	Latent Channels	Download Link
sota-4z	4	Download
sota-4z-text	4	Download
sota-16z	16	Download
sota-16z-text	16	Download

Note: '4z' and '16z' indicate the number of latent channels in the VAE model. Models with 'text' support text guidance.

📁 Data Preparation

To reconstruct videos and images using our VAE model, organize your data in the following structure:

Videos

Place your videos and optional captions in the examples/videos/gt directory.

Directory Structure:

examples/videos/
├── gt/
│   ├── video1.mp4
│   ├── video1.txt  # Optional caption
│   ├── video2.mp4
│   ├── video2.txt
│   └── ...
├── recon/
    └── (reconstructed videos will be saved here)

Captions: For cross-modal reconstruction, include a .txt file with the same name as the video containing its caption.

Images

Place your images in the examples/images/gt directory.

Directory Structure:

examples/images/
├── gt/
│   ├── image1.jpg
│   ├── image2.png
│   └── ...
├── recon/
    └── (reconstructed images will be saved here)

Note: The images dataset does not require captions.

🔧 Inference

Our video VAE supports both image and video reconstruction.

Please ensure that the ckpt_path in all your configuration files is set to the actual path of your checkpoint.

Video Reconstruction

Run video reconstruction using:

bash scripts/run_inference_video.sh

This is equivalent to:

python inference_video.py \
    --data_root 'examples/videos/gt' \
    --out_root 'examples/videos/recon' \
    --config_path 'configs/inference/config_16z.yaml' \
    --chunk_size 8 \
    --resolution 720 1280

If the chunk size is too large, you may encounter memory issues. In this case, reduce the chunk_size parameter. Ensure the chunk_size is divisible by 4.
To enable cross-modal reconstruction using captions, modify config_path to 'configs/config_16z_cap.yaml' for the 16-channel model with caption guidance.

Image Reconstruction

Run image reconstruction using:

bash scripts/run_inference_image.sh

This is equivalent to:

python inference_image.py \
    --data_root 'examples/images/gt' \
    --out_root 'examples/images/recon' \
    --config_path 'configs/inference/config_16z.yaml' \
    --batch_size 1

Note: that the batch size is set to 1 because the images in the example folder have varying resolutions. If you have a batch of images with the same resolution, you can increase the batch size to accelerate inference.

🏋️ Training

Quick Start

To start training, use the following command:

bash scripts/run_training.sh config_16z

This default command trains the 16-channel model with video reconstruction on a single GPU.

Configuration Options

You can modify the training configuration by changing the config parameter:

config_4z: 4-channel model
config_4z_joint: 4-channel model trained jointly on both image and video data
config_4z_cap': 4-channel model with text guidance
config_16z: Default 16-channel model
config_16z_joint: 16-channel model trained jointly on both image and video data
config_16z_cap: 16-channel model with text guidance

Note: Do not include the .yaml extension when specifying the config.

Data Preparation

Dataset Structure

The training data should be organized in a CSV file with the following format:

path,text
/absolute/path/to/video1.mp4,A person walking on the beach
/absolute/path/to/video2.mp4,A car driving down the road

Requirements:

Use absolute paths for video files
Include two columns: path and text
For training without text guidance, leave the caption column empty but maintain the CSV structure

Example CSV:

# With captions
/data/videos/clip1.mp4,A dog playing in the park
/data/videos/clip2.mp4,Sunset over the ocean

# Without captions
/data/videos/clip1.mp4,
/data/videos/clip2.mp4,

📊 Evaluation

Use the provided scripts to evaluate reconstruction quality using PSNR, SSIM, and LPIPS metrics.

Evaluate Image Reconstruction

bash scripts/evaluation_image.sh

Evaluate Video Reconstruction

bash scripts/evaluation_video.sh

📝 License

Please follow CC-BY-NC-ND.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE

Project Page | Paper | High-Res Demo

✨ Features

📰 News

⏰ Todo

🚀 Get Started

1. Clone the Repository

2. Set Up the Environment

📦 Pretrained Models

📁 Data Preparation

Videos

Directory Structure:

Images

Directory Structure:

🔧 Inference

Video Reconstruction

Image Reconstruction

🏋️ Training

Quick Start

Configuration Options

Data Preparation

Dataset Structure

Requirements:

Example CSV:

📊 Evaluation

Evaluate Image Reconstruction

Evaluate Video Reconstruction

📝 License

Star History

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
ckpt		ckpt
configs		configs
data		data
docs		docs
evaluation		evaluation
examples		examples
scripts		scripts
src		src
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference_image.py		inference_image.py
inference_video.py		inference_video.py
requirements.txt		requirements.txt
train.py		train.py

License

VideoVerses/VideoVAEPlus

Folders and files

Latest commit

History

Repository files navigation

VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE

Project Page | Paper | High-Res Demo

✨ Features

📰 News

⏰ Todo

🚀 Get Started

1. Clone the Repository

2. Set Up the Environment

📦 Pretrained Models

📁 Data Preparation

Videos

Directory Structure:

Images

Directory Structure:

🔧 Inference

Video Reconstruction

Image Reconstruction

🏋️ Training

Quick Start

Configuration Options

Data Preparation

Dataset Structure

Requirements:

Example CSV:

📊 Evaluation

Evaluate Image Reconstruction

Evaluate Video Reconstruction

📝 License

Star History

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages