Subject-driven video generation (SDV-Gen) adapts a pretrained video model to produce videos of a specific subject. Per-subject tuning is expensive, while prior zero-shot methods avoid test-time tuning but usually require millions of subject-video pairs and 10K to 200K A100 GPU hours.
We learn a single zero-shot model from 200K subject-image pairs and 4,000 arbitrary videos. With CogVideoX-5B, training takes 288 A100 GPU hours, about 1% of the compute used by prior zero-shot baselines, without any subject-video pairs.
The method decomposes learning into identity injection from subject images and motion-awareness preservation from arbitrary videos, optimized with stochastic switching, random reference-frame sampling, and image-token dropout. The same recipe also transfers to Wan 2.2-5B.
We view zero-shot SDV-Gen as a dual-task learning problem: identity injection learns subject appearance from subject-image pairs, and motion-awareness preservation maintains video dynamics from arbitrary videos.
Stochastic switching alternates the identity and motion objectives in one training run. During motion updates, random reference-frame sampling and image-token dropout reduce first-frame copying.
Key points:
@article{kim2025learning,
author = {Kim, Daneul and Zhang, Jingxu and Jin, Wonjoon and Cho, Sunghyun and Dai, Qi and Park, Jaesik and Luo, Chong},
title = {Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute},
journal = {arXiv preprint arXiv:2504.17816v3},
year = {2025},
}