Improve short video consistency with stable diffusion
Improve Short Video Consistency With Stable Diffusion
Stable diffusion has a built-in example for img2img generation and thus we could easily adopt it for vid2vid, however, it does not seem to be good enough keeping the video frames consistent and smooth
In case you have doubts, I already used fixed noise and fixed seed(s) for all frames, now we can focus on the obvious problems
The gif above is resized and compressed for better webpage loading, not the original length and quality
- One problem is that if we select a ‘noising strength’ too low as right-top-corner (–strength 0.45), the model seems doing trival edits which does not do anything but for adding jumping artifacts across frames
- Another problem is that if we select a higher ‘noising strength’ as left-bottom-corner (–strength 0.75), the model ignores the obvious object across frames and makes the car disappear, and I still feel it not artistic enough
Here I adopt a idea from paper Deep Video Prior for Video Consistency and Propagation, and make it like right-bottom-corner achieving better video consistency for short videos
Not the old content style balance problem
If you remember the old neural style stuff, you could recall something named content style balance, there is a magic ratio to be tuned manually so as to find better trade-off on content fidelity against style
Here we have a parameter ‘noising strength’, you put a 0.01 and got near exact the original content, and you put a 0.99 for total imagination with prompt, could there be a satisfying value in the middle ?
Well, I couldn’t find one, and even with my hack done, the video is still kind of jumpy, the improvement is limited
You have to increase content fidelity by using a lower noising strength for video frames consistency, but how are you going to make notable text prompt edits on such low noising strength ?
Now we got a problem to solve
Short video as the unconditional dataset
We hope the stable diffusion model to generate video frames according to the reference video, at some degree, we do not wish to generate something far from all frames
So we could finetune the stable diffusion model to reconstruct better video frames if not given any text prompts, then use text prompts to edit them
A fun fact is that after many experiments, I found 30 frames is good enough to deal with a 300 frames short video, not really need to finetune on them all, unless your video got sudden subject twists
Text to image as the conditional dataset
Select a frame as a example, do txt2img until you are satisfied, with a rather large noising strength, don’t worry about the content may inconsistent with the original frame yet, we have more steps further down
- It is okay that the edited frame has obviously changed too much in color space, for example black shirts to red dress, you may use (–strength 0.75) and even more
- It is NOT okay if the subject changed composition too much, for example human arm position may change a lot, generate more images to select the nearest one, or decrease the noising strength, frames are going jumpy otherwise
- Remember the text prompt
Finetune on combined dataset
Now, we got a unconditional dataset which consists 30 frames, with empty text embedding, a conditional dataset consists maybe 2 different text prompts on 1 frame
So we have a dataset of 32 frames in total
Let’s resume training on stable diffusion as finetuning, if you have not read my previous post about how to finetune the model, it is time to go for it now
I have also employed some techniques I discovered earlier, including only finetune on late steps to speed up training
And make sure text conditional dataset start denoising from its paired original frame as starting point
Due to the small amount of frames (32 in this case), the whole process is within hours, for one single video, but the output quality still needs to be improved
Further details
The original video is from youtube ‘https://www.youtube.com/channel/UCBcVQr-07MH-p9e2kRTdB3A’, author J Utah, cropped 10 seconds (from 1.5 hour) and to 512x512
The text prompt is “a abstract painting of a cyberpunk city night, tron robotic, trending on artstation”
Strength parameter for clockwise: original, 0.45, 0.45, 0.75 (after finetuning, you can lower the strength parameter to get more fidelity, I use 0.45 for comparison, for human 0.325 is good enough)
Finetuned for 1000 iters (for human only need around 400 iters), 1e-5 lr, late steps 500
Generation using 50 steps (lazy, nah)
Using blender for linux to combine the image to videos
Citations
@inproceedings{lei2020dvp,
title={Blind Video Temporal Consistency via Deep Video Prior},
author={Lei, Chenyang and Xing, Yazhou and Chen, Qifeng},
booktitle={Advances in Neural Information Processing Systems},
year={2020}
}
@article{DVP_lei,
author = {Chenyang Lei and
Yazhou Xing and
Hao Ouyang and
Qifeng Chen},
title = {Deep Video Prior for Video Consistency and Propagation},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {To Appear}
}
@misc{rombach2021highresolution,
title={High-Resolution Image Synthesis with Latent Diffusion Models},
author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
year={2021},
eprint={2112.10752},
archivePrefix={arXiv},
primaryClass={cs.CV}
}