Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions #55

Open
radna0 opened this issue Jul 13, 2024 · 1 comment
Open

Questions #55

radna0 opened this issue Jul 13, 2024 · 1 comment

Comments

@radna0
Copy link

radna0 commented Jul 13, 2024

How might EasyAnimate slice a 1080p video? Or more specifically what is the frame interval of which the slicing happens? Assuming this is the memory requirements for resolutions lower than 1080p.

Est: 144 Frame 1920x1080: 64-80GB?

  • Is it possible to further lower the memory usage of the model? What is the bottleneck here? The VAE? The DIT? Can we quantize them?
  • Is it possible to run the model on multiple GPUs? Have you guys implemented something like device_map from accelerate to do model parallelism?
  • In Open-Sora-Plan v1.1 technical report they had to reduce the number of 3DConv to handle longer videos during DIT training? Meaning they had to also train the encoder but not the decoder? Why EasyAnimate doesn't need to unfreeze the encoder and can still train normally?
  • CV-VAE uses SD2.1's VAE which has z=4 latent, and they are encountering losing specific details. They plan to train the SD3's VAE which has z=16, to solve this problem? Does EasyAnimate suffer the same problem? How does it solve this?
  • For Video Captioning, what about using dense captions? For example the ShareCaptioner model does a very good job on dense video captioning. Assuming Adaln is only viable for a set of classes, but you are using cross-attention to condition the data, shouldn’t dense captions help in this case?
  • Also because the VAE is slicing the video frames to encode them, is it possible to do frame interpolation? Image to video works, is it also possible to do middle/end frame extension? Or even connecting different videos?

image

For context, I want to train or use parts of the architecture to train on animation data

@yunkchen
Copy link
Collaborator

  • First, we recommend reading our paper on arXiv, especially the 'Slice VAE' section.
  • Our current higher priority is to improve the quality of our generated videos: consistency, action continuity, prompt control, etc. The efficiency of inference will be developed afterwards.
  • We haven't tried model parallelism ourselves, but it is definitely feasible.
  • We first trained our own Slice VAE.
  • We have tried increasing the dimension of the channels, but the number of parameters increases significantly; this might be a way to improve performance.
  • An accurate and comprehensive Dense Captioner is very important, we experimented with many, and eventually trained our own model.
  • This task should be feasible, but we haven't trained it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants