Make A Longer Stable Diffusion Video On Home Computers

The gif is manually resized to 256x256 and heavy lossy compressed using gifsicle -O3 --lossy=200 --resize 256x256 for better blog network loading speed, originally trained at 512x512 and 64 frames

However these gifs are more than 1MB each, so if you have trouble loading the gif, you may need to go and download from github blog repo yourself, I can’t compress the gifs any further

The dataset and model heavy resemble real life human with personal identity such as faces and bodys, thus can not go opensource for legal concerns

Longer Problem For Longer Video

It simply just won’t work if you set frames_length to higher value and press enter harder with your finger
My previous timelapse toy model used timelapse video dataset, although they have adequate clip length for longer video experiment, it doesn’t make sense training longer sequence when shorter is good enough
As in tradition, one single RTX 3090Ti (24 GB vram) is what I got, and all the fancy longer video generation stuff, I mean it to get it done with the exact same home computer computation limitations

The Missing Make A Video Technique

Well, I already give out make-a-stable-diffusion-video github repo to demonstrate how to make it work, especially on home computers

And I stated that in my last blog post: ‘Oh, I can not afford that with 24gb vram, let’s just pretend there isn’t a whole paragraph in make-a-video paper explaining video frame interpolation’

Now that I’m gonna try more frames and wish to get coherent results for a long range of frames, and probably finish training myself instead of leaving a letter to my grandson

Video frame interpolation is what I need, the missing piece

https://paperswithcode.com/task/video-frame-interpolation

So I made a hack to my code, using the inpainting model special feature to implement fast and incorrect interpolation

hint_latents = latents[:,:,0::4,:,:]
hint_latents_expand = hint_latents.repeat_interleave(4,2)
hint_latents_expand = hint_latents_expand[:,:,:args.frames_length,:,:]
latent_model_input = torch.cat([noisy_latents, masks_input, hint_latents_expand], dim=1).to(accelerator.device)

Well, for every 4 frames, set the original frame to inpainting condition input to generate exact frame image since I masked nothing

Good thing is this hack is almost one line without custom attention module modification, bad thing is this is mathematically wrong because I really should set the static frame without model backbone inference on it

And as for duplicating the hint input to every subordinate frames, I didn’t do research on its effects, I can’t answer it because I have no clue myself

So, here is the plan:

generate 5 frames
interpolate to 17 frames, (5-1)x4+1=17
interpolate to 65 frames, (17-1)x4+1=65

Attention Is All I Can Not Afford

Surely I wouldn’t meet many trouble dealing with 5 frames, whatever attention I use

But 65 frames leads to a huge problem, especially when we do interpolation

Normal attention has O(n^2) complexity

For 5 frames, n is 5, so that would be 5^2=25 units of complexity For 65 frames, n is 65, so that would be 65^2=4225 units of complexity

So it is obvious I need to train 65 frames model 169x times than the 5 frames model, this could be a problem

I actually tried that for comparison, the 65 frames model is a total mess visually and by loss curve, even already trained for 20 epochs, I should have saved the screenshot, but I get too frustrated and forgot

Here comes a better(?) attention mechanism for long sequences

https://paperswithcode.com/method/sliding-window-attention

With sliding window attention such as local attention, you only need O(n x w) complexity, for 65 frames and windowsize 5, its 65x5=325 units of complexity, compared to 4225 it almost seem like a silver bullet

But, does it ?

By using https://github.com/facebookresearch/xformers/blob/main/xformers/components/attention/local.py

I found the shortcomings of local attention, which is too narrow minded on adjacent frames, the most notable effect is the identity of the main subject changes rapidly across frames

Good thing is local attention’s shortcomings are trival to our interpolation network, we always have a reference frame every 4th frames, so no need to worry about identity change

Note: I have no idea what windowsize is the best, I just used 5 which is the default value

It turned out by using local attention, I am being able to train the 65 frames interpolation network for 20 epochs and seems to be good enough (my fancy way of saying I ran out of patience and stopped)

A Not Too Short And Not Too Small Dataset For Testing

Many would ask: what’s wrong with the timelapse video dataset, it’s long enough and you already have that, why bother making another dataset ?

timelapse video dataset is too small (286 videos), it does not generalize well to make creativity art, a cat standing idle with clouds moving is almost the only thing it does
stable diffusion is good at generating landscape images, and human eyes are not sensitive about nature scenes, nature landscape change very little across time, a timelapse dataset can cover hidden technical problems, but this time I shall face the real challenge

So, this time I made a ‘fashion model walking on stage’ video dataset, it has the following features:

2848 videos, all above 100 frames, almost 10x the size of timelapse video dataset
contains human, a forbidden area for stable diffusion both for poor generation quality and for legal obligations
human changes scale across time, far to near, not the same size

Speak of legal obligations, I am not being able to opensource the dataset or the model, because it is trained on human, it has to more or less contain personal identity information such as faces and bodys

All I could publish is where I got the raw videos:

https://www.youtube.com/@yaersfashiontv

And I used blender to preprocess the videos, manually

Unverified experimental hacks

Except for what I did and tested above, I actually experimented custom attention patterns like always attend to first frame no matter what window size

But I can not tell the difference, so without proof I can just say I am not able to confirm whether they work or not

Recommended Opensource Implementations

I have noticed that there is a cleaner and easier to use implementation other than my make-a-stable-diffusion-video repo

If you got enough vram and wish not to use my hacks (which mainly focus on running under 24GB vram), you can check this work in progress implementation by chavinlo

https://github.com/chavinlo/TempoFunk

Citations

Thanks to the opensource repos made by https://github.com/lucidrains , including but not limited to:

https://github.com/lucidrains/make-a-video-pytorch

https://github.com/lucidrains/video-diffusion-pytorch

And my code is based on https://github.com/huggingface/diffusers, especially most of the speed up tricks are bundled within the original repository

@misc{Singer2022,
    author  = {Uriel Singer},
    url     = {https://makeavideo.studio/Make-A-Video.pdf}
}

@misc{ho2022video,
  title   = {Video Diffusion Models}, 
  author  = {Jonathan Ho and Tim Salimans and Alexey Gritsenko and William Chan and Mohammad Norouzi and David J. Fleet},
  year    = {2022},
  eprint  = {2204.03458},
  archivePrefix = {arXiv},
  primaryClass = {cs.CV}
}

@misc{rombach2021highresolution,
      title={High-Resolution Image Synthesis with Latent Diffusion Models}, 
      author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
      year={2021},
      eprint={2112.10752},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{von-platen-etal-2022-diffusers,
  author = {Patrick von Platen and Suraj Patil and Anton Lozhkov and Pedro Cuenca and Nathan Lambert and Kashif Rasul and Mishig Davaadorj and Thomas Wolf},
  title = {Diffusers: State-of-the-art diffusion models},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/diffusers}}
}