Subject-driven video generation (S2V) requires preserving a subject's identity across diverse motions and scenes. Existing tuning-free methods achieve zero-shot generalization but demand prohibitive resources, requiring millions of subject-video pairs and 10K to 200K A100 GPU hours for training.
We present a dual-task framework that decouples S2V into (i) identity injection from subject-to-image (S2I) pairs and (ii) temporal-awareness preservation from unpaired proxy videos. Our method fine-tunes CogVideoX-5B using only 200K S2I pairs and 4,000 proxy videos (~1% of Pexels-400K) in just 288 A100 GPU hours, which is 35 to 730× faster than tuning-free baselines, while requiring no per-subject fine-tuning at inference.
A key insight is that the identity and temporal gradients exhibit emergent near-orthogonality under our stochastic dual-task training. This allows simultaneous optimization of both objectives without catastrophic interference: the model learns subject identity without forgetting how to generate natural motion dynamics.
Data Efficiency
We additionally note that our method achieves comparable video quality using only 4,000 S2I pairs, which is just 2% of the full S2I dataset (200K pairs).
We provide the comparison videos below. Please click the section to expand.