Our stochastic-switching strategy in finetuning shows better quality compared to two-stage finetuning (i.e., image-only finetuning, image-to-video finetuning in sequential manner) or image-only finetuning.
Subject-driven video generation is a challenging task that requires the model to generate customized videos of a specific subject while maintaining the identity of the subject in the context of given prompt. Our method achieves this by leveraging a image customization dataset without using the paired video customization dataset.
Our framework enables subject-driven video customization without requiring video customization data by leveraging image customization data and unpaired video data to achieve strong subject consistency and scalability.
Our contributions are as follows:
Thanks to our stochastic-switching strategy in finetuning, our method can generate high-quality videos with high scores across all metrics of Motion Smoothness, Dynamic Degree, CLIP, and DINO.
Details can be found on our Paper-To-Be-Released.
Our stochastic-switching strategy in finetuning shows better quality compared to two-stage finetuning (i.e., image-only finetuning, image-to-video finetuning in sequential manner) or image-only finetuning.
We further evaluate our method using the
FloVD
Temporal Evaluation Protocol to assess degradation in dynamic video generation in FVD score,
and we achieve compatible motion dynamics performance compared to the original CogVideoX.
Details can be found on our Paper-To-Be-Released.
@article{kim2025subject,
author = {Kim, Daneul and Zhang, Jingxu and Jin, Wonjoon and Cho, Sunghyun and Dai, Qi and Park, Jaesik and Luo, Chong},
title = {Subject-driven Video Generation via Disentangled Identity and Motion},
journal = {arXiv},
year = {2025},
}