Ground Truth (GT) | Reconstructed |
---|---|
Yazhou Xing*, Yang Fei*, Yingqing He*†, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen† (*equal contribution, †corresponding author) The Hong Kong University of Science and Technology
A state-of-the-art Video Variational Autoencoder (VAE) designed for high-fidelity video reconstruction. This project leverages cross-modal and joint video-image training to enhance reconstruction quality.
- High-Fidelity Reconstruction: Achieve superior image and video reconstruction quality.
- Cross-Modal Reconstruction: Utilize captions to guide the reconstruction process.
- State-of-the-Art Performance: Set new benchmarks in video reconstruction tasks.
- [Jan 2025] 🏋️ Released training code & better pretrained 4z-text weight
- [Dec 2024] 🚀 Released inference code and pretrained models
- [Dec 2024] 📝 Released paper on arXiv
- [Dec 2024] 💡 Project page is live at VideoVAE+
- Release Pretrained Model Weights
- Release Inference Code
- Release Training Code
Follow these steps to set up your environment and run the code:
git clone https://github.com/VideoVerses/VideoVAEPlus.git
cd VideoVAEPlus
Create a Conda environment and install dependencies:
conda create --name vae python=3.10 -y
conda activate vae
pip install -r requirements.txt
Model Name | Latent Channels | Download Link |
---|---|---|
sota-4z | 4 | Download |
sota-4z-text | 4 | Download |
sota-16z | 16 | Download |
sota-16z-text | 16 | Download |
- Note: '4z' and '16z' indicate the number of latent channels in the VAE model. Models with 'text' support text guidance.
To reconstruct videos and images using our VAE model, organize your data in the following structure:
Place your videos and optional captions in the examples/videos/gt
directory.
examples/videos/
├── gt/
│ ├── video1.mp4
│ ├── video1.txt # Optional caption
│ ├── video2.mp4
│ ├── video2.txt
│ └── ...
├── recon/
└── (reconstructed videos will be saved here)
- Captions: For cross-modal reconstruction, include a
.txt
file with the same name as the video containing its caption.
Place your images in the examples/images/gt
directory.
examples/images/
├── gt/
│ ├── image1.jpg
│ ├── image2.png
│ └── ...
├── recon/
└── (reconstructed images will be saved here)
- Note: The images dataset does not require captions.
Our video VAE supports both image and video reconstruction.
Please ensure that the ckpt_path
in all your configuration files is set to the actual path of your checkpoint.
Run video reconstruction using:
bash scripts/run_inference_video.sh
This is equivalent to:
python inference_video.py \
--data_root 'examples/videos/gt' \
--out_root 'examples/videos/recon' \
--config_path 'configs/inference/config_16z.yaml' \
--chunk_size 8 \
--resolution 720 1280
-
If the chunk size is too large, you may encounter memory issues. In this case, reduce the
chunk_size
parameter. Ensure thechunk_size
is divisible by 4. -
To enable cross-modal reconstruction using captions, modify
config_path
to'configs/config_16z_cap.yaml'
for the 16-channel model with caption guidance.
Run image reconstruction using:
bash scripts/run_inference_image.sh
This is equivalent to:
python inference_image.py \
--data_root 'examples/images/gt' \
--out_root 'examples/images/recon' \
--config_path 'configs/inference/config_16z.yaml' \
--batch_size 1
- Note: that the batch size is set to 1 because the images in the example folder have varying resolutions. If you have a batch of images with the same resolution, you can increase the batch size to accelerate inference.
To start training, use the following command:
bash scripts/run_training.sh config_16z
This default command trains the 16-channel model with video reconstruction on a single GPU.
You can modify the training configuration by changing the config parameter:
config_4z
: 4-channel modelconfig_4z_joint
: 4-channel model trained jointly on both image and video dataconfig_4z_cap
': 4-channel model with text guidanceconfig_16z
: Default 16-channel modelconfig_16z_joint
: 16-channel model trained jointly on both image and video dataconfig_16z_cap
: 16-channel model with text guidance
Note: Do not include the .yaml
extension when specifying the config.
The training data should be organized in a CSV file with the following format:
path,text
/absolute/path/to/video1.mp4,A person walking on the beach
/absolute/path/to/video2.mp4,A car driving down the road
- Use absolute paths for video files
- Include two columns: path and text
- For training without text guidance, leave the caption column empty but maintain the CSV structure
# With captions
/data/videos/clip1.mp4,A dog playing in the park
/data/videos/clip2.mp4,Sunset over the ocean
# Without captions
/data/videos/clip1.mp4,
/data/videos/clip2.mp4,
Use the provided scripts to evaluate reconstruction quality using PSNR, SSIM, and LPIPS metrics.
bash scripts/evaluation_image.sh
bash scripts/evaluation_video.sh
Please follow CC-BY-NC-ND.