Subject-driven Video Generation via Disentangled Identity and Motion

ArXiv 2025

1Seoul National University, 2Microsoft Research Asia, 3POSTECH

Subject-driven Video Generation

Subject-driven video generation is a challenging task that requires the model to generate customized videos of a specific subject while maintaining the identity of the subject in the context of given prompt. Our method achieves this by leveraging a image customization dataset without using the paired video customization dataset.

Our Approach

Our framework enables subject-driven video customization without requiring video customization data by leveraging image customization data and unpaired video data to achieve strong subject consistency and scalability.

Framework diagram showing ID injection and Temporal-aware preservation optimization
We alternatingly optimize ID injection and Temporal-aware preservation objectives.

Our contributions are as follows:

  • We propose a novel framework that enables subject-driven video customization without requiring video customization data.
  • We introduce a stochastic-switching strategy in finetuning to improve the quality of generated videos.
  • We demonstrate the effectiveness of our approach through extensive experiments and comparisons with existing methods.

Comparison with VideoBooth, OminiControl, BLIP, IP-Adapter+I2V, and Vidu

Thanks to our stochastic-switching strategy in finetuning, our method can generate high-quality videos with high scores across all metrics of Motion Smoothness, Dynamic Degree, CLIP, and DINO.

Details can be found on our Paper-To-Be-Released.

Quantitative results chart comparing different methods

Comparison with Still-Moving

Ablation Study

Ablation in Training Strategy

Our stochastic-switching strategy in finetuning shows better quality compared to two-stage finetuning (i.e., image-only finetuning, image-to-video finetuning in sequential manner) or image-only finetuning.

We further evaluate our method using the FloVD Temporal Evaluation Protocol to assess degradation in dynamic video generation in FVD score, and we achieve compatible motion dynamics performance compared to the original CogVideoX.

Details can be found on our Paper-To-Be-Released.

Ablation study table showing different training strategies and their results

Ablation in Temporal-aware Preservation Training

BibTeX

@article{kim2025subject,
  author    = {Kim, Daneul and Zhang, Jingxu and Jin, Wonjoon and Cho, Sunghyun and Dai, Qi and Park, Jaesik and Luo, Chong},
  title     = {Subject-driven Video Generation via Disentangled Identity and Motion},
  journal   = {arXiv},
  year      = {2025},
}