Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute

arXiv:2504.17816v3

1Seoul National University, 2POSTECH, 3Microsoft Research Asia
§Work done during internship at Microsoft Research Asia.

Overview

Subject-driven video generation (SDV-Gen) adapts a pretrained video model to produce videos of a specific subject. Per-subject tuning is expensive, while prior zero-shot methods avoid test-time tuning but usually require millions of subject-video pairs and 10K to 200K A100 GPU hours.

We learn a single zero-shot model from 200K subject-image pairs and 4,000 arbitrary videos. With CogVideoX-5B, training takes 288 A100 GPU hours, about 1% of the compute used by prior zero-shot baselines, without any subject-video pairs.

The method decomposes learning into identity injection from subject images and motion-awareness preservation from arbitrary videos, optimized with stochastic switching, random reference-frame sampling, and image-token dropout. The same recipe also transfers to Wan 2.2-5B.

Results

Our Approach

We view zero-shot SDV-Gen as a dual-task learning problem: identity injection learns subject appearance from subject-image pairs, and motion-awareness preservation maintains video dynamics from arbitrary videos.

Stochastic switching alternates the identity and motion objectives in one training run. During motion updates, random reference-frame sampling and image-token dropout reduce first-frame copying.

Framework diagram showing ID injection and Temporal-aware preservation optimization
We stochastically switch between identity injection and motion-awareness preservation during training.

Key points:

  • No test-time per-subject tuning and no large-scale subject-video pairs.
  • 200K subject-image pairs plus 4,000 arbitrary videos adapt CogVideoX-5B in 288 A100 GPU hours.
  • Gradient analysis shows the identity and motion objectives rapidly move toward near-orthogonal update subspaces.

BibTeX

@article{kim2025learning,
  author    = {Kim, Daneul and Zhang, Jingxu and Jin, Wonjoon and Cho, Sunghyun and Dai, Qi and Park, Jaesik and Luo, Chong},
  title     = {Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute},
  journal   = {arXiv preprint arXiv:2504.17816v3},
  year      = {2025},
}