Scaling Up Longer Video Generation Model Training Without More GPUs

The gif is manually resized to 256x256 and heavy lossy compressed using gifsicle -O3 --lossy=200 --resize 256x256 for better blog network loading speed, originally trained at 512x512 and generates more than 181 frames

However these gifs are more than 1MB each, so if you have trouble loading the gif, you may need to go and download from github blog repo yourself, I can’t compress the gifs any further

The dataset and model heavy resemble real life human with personal identity such as faces and bodys, thus can not go opensource for legal concerns

TL;DR

I used my RTX 3090Ti and created a 24370 clips dataset and trained a model under 24GB vram limitation that is capable of generating hundreds of frames with some consistency to the first frame, but during this experiment I changed every possible thing mid-training so there is no solid proof of what I learnt except for it more or less works this way

Scaling up the dataset

Last time I hand crafted a walking on stage video dataset containing 2848 clips, and I trained on each first 65 frames

Which is bigger than the far previous 286 timelapse video dataset, but still too small for some real challenge

So I gathered a human dancing dataset from various internet sources, containing 24370 video clips and has 181 frames each

It is the most difficult subject for image generation and video generation: human and rapid motion

The clips are aligned using pose detection, and resized to 512x512
Each clip contains at maximum 2 alternative augmentation, so there are more than 24370 actual clips when training
Contains some “bad” clips which contains heavy camera motion, or the human ran out of screen

Scaling up video duration by interpolation and extrapolation

Last time I did video interpolation on the whole clip, which contains two interpolation stages: 5 frames –> 17 frames –> 65 frames

And using local attention to crop down computational requirements

Although it is working at least, but generating 65 frames already consumed 24GB vram even with accelerate/deepspeed optimization and gradient checkpointing

If to generate as long as 181 frames, I decided to train the model in a autoregressive way

a base model generating 4 frames, and with some hacky inference technique, can generate 7 frames, as called “starter model”
a extrapolation model generate new 3 frames from the previous frames, but with a step of every 4th frames
a interpolation model fills the previously newly generated 3 frames with a total of 9 frames (fill in 3 frames into the two gaps)

The frame number generated in the following way (newly generated frame ends with a !):

1, 2!, 3!, 4!, 5!, 6!, 7!
1, 2, 3, 7, 11!, 15!, 19!
1, 2, 6, 7, 8!, 9!, 10!, 11 …

I know it’s vague, don’t get too serious about it, it is a rough hack by myself and does not work too well, for now

Good thing is that by this method, I don’t need to do gradient checkpointing and cpu offloading, which speeds up training further, not to mention 7 frames iters far quicker every step than 65 frames

However, when handling dataset this large, I need to further speed it up, not only on the training side

Dataset hack & cheating

Well, if you are doing academical reserch, don’t do anything like this

I got inspired by https://arxiv.org/abs/2206.07137, as the title suggests:

Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

I decided to train the model on the full dataset first (and never ending), for a few epochs, then determine which data points are worth learning

I don’t want to talk about it in detail, changing the dataset itself mid-training is a forbidden method because the result of the training became unreproducible and fragile

However since I never finish training or writing papers, so this is not a problem to me

I deleted 10% of the dataset which seems to be hard to learn in the first starter stage

And also 10% in the expolation skipper stage, too, but not the same 10% (yeah, maybe it would be better if treated the same)

The training loss drop like crazy, and the test generation is improving faster after since

However, I can’t give out any proof of what I felt, this experiment is totally inaccurate because of the following reasons

Noise augmentation to keep the long time consistency

When doing autoregressive generation, error gathers every iteration, so say we generate 120 frames in my method, we need to generate around 10 loops, each loop using previously generated content as hint, and that could be very inaccurate

So firstly I employ signal noise ratio 0.33 at generated frames (but not the first hint frame), it seems to be good when testing with very few test clips

Then I found it wasn’t enough, then I changed the augmentation noise, from signal noise ratio 0.33 to 0.66, it gets better, I feel better, no proof of any kind however …

And this means I changed the augmentation mid-training, I would be fired if I am a scientist LOL

half-way fix of half-way attention

When I coded the first version of this experiment, I used a half-way attention to split the sequence in half then combine after calculation

Which yields max 0.06 error every time and the average error is 0.01, I thought that was acceptable, much better than out of vram doing nothing

But yet I forgot about it, and didn’t revert the half-way attention hack, when I realized about this, I decided to revert to ‘correct as a whole’ attention mid-training

Okay, this is to say, I changed the model structure at mid training, this is not good, very not good, but neccesary

Power failure and forgot to dump adam optimizer state

Em… yeah, I forgot to dump adam optimizer state at first, then my apartment got power failure mid-training

So, the training does not need to restart from the beginning but the training loss went crazy for days before it talks sense

What I learned from the experiment

So much for confession, despite all the bad things I hacked and fixed, I actually learned something as follows

Always dump optimizer states when training with adam something
A hack can be helpful at first when testing, if you forget about it when scaling up, it could be a disaster
Noise augmentation is very cool, but determine how much noise to add, is a total pain in the (beep)
Autoregressive is good, saves vram, saves time, if you code it right, it will crash later than sooner
I realized I have to redo the experiment again with smarter generation schedule to make sure the quality won’t drop significantly across time, not to mention everything I did wrong

Not Really a Conclusion

I changed model structure, augmentation, dataset, and optimizer state mid-training, these are unforgivable mistakes that should be avoided, but

At least it works, barely works, but it works

And hey, it’s under 24GB vram, and capable of generating hundreds of frames

I am so eager to share with everyone what I did good, but currently the quality is poor, that is to say I am not doing good for now

So at it’s current state, if to claim that the model works, it would be a false claim, sharing non-working code would be irresponsible and thus I won’t update my github repo this time, but hopefully not for long

Limitations

Every time the generated illustrated figure tries to turn their heads left or right, it creates artifacts, stable diffusion v1.5 cannot handle these circumstances well
The generated figure tends to become female in the autoregressive pipeline, due to the dataset bias
Although in theory it can generate unlimited length of clips, human rapid actions always reach a status that the generation is broken, such as too far or too close to the camera etc
If the generated figure not moving fast, there is overfitting on background

Citations

Thanks to the opensource repos made by https://github.com/lucidrains , including but not limited to:

https://github.com/lucidrains/make-a-video-pytorch

https://github.com/lucidrains/video-diffusion-pytorch

And my code is based on https://github.com/huggingface/diffusers, especially most of the speed up tricks are bundled within the original repository

@misc{mindermann2022prioritized,
      title={Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt}, 
      author={Sören Mindermann and Jan Brauner and Muhammed Razzak and Mrinank Sharma and Andreas Kirsch and Winnie Xu and Benedikt Höltgen and Aidan N. Gomez and Adrien Morisot and Sebastian Farquhar and Yarin Gal},
      year={2022},
      eprint={2206.07137},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

@misc{Singer2022,
    author  = {Uriel Singer},
    url     = {https://makeavideo.studio/Make-A-Video.pdf}
}

@misc{von-platen-etal-2022-diffusers,
  author = {Patrick von Platen and Suraj Patil and Anton Lozhkov and Pedro Cuenca and Nathan Lambert and Kashif Rasul and Mishig Davaadorj and Thomas Wolf},
  title = {Diffusers: State-of-the-art diffusion models},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/diffusers}}
}