Our stochastic-switching strategy in finetuning shows better quality compared to two-stage finetuning (i.e., image-only finetuning, image-to-video finetuning in sequential manner) or image-only finetuning.
Subject-driven video generation is a challenging task that requires the model to generate customized videos of a specific subject while maintaining the identity of the subject in the context of given prompt. Our method achieves this by leveraging a image customization dataset without using the paired video customization dataset.
Our framework enables subject-driven video customization without requiring video customization data by leveraging image customization data and unpaired video data to achieve strong subject consistency and scalability.
Our contributions are as follows:
Thanks to our stochastic-switching strategy in finetuning, our method can generate high-quality videos with high scores across all metrics of Motion Smoothness, Dynamic Degree, CLIP, and DINO.
Details can be found on our paper.
Our stochastic-switching strategy in finetuning shows better quality compared to two-stage finetuning (i.e., image-only finetuning, image-to-video finetuning in sequential manner) or image-only finetuning.
@article{kim2025subject,
author = {Kim, Daneul and Zhang, Jingxu and Jin, Wonjoon and Cho, Sunghyun and Dai, Qi and Park, Jaesik and Luo, Chong},
title = {Subject-driven Video Generation via Disentangled Identity and Motion},
journal = {arXiv},
year = {2025},
}