Detailed benchmark results for subject-driven video generation
We evaluate our method on two standard benchmarks: VBench for video generation quality and Open-S2V Benchmark for subject-driven video generation. Our evaluation uses 30 reference images from state-of-the-art image customization papers along with the DreamBooth dataset, with 4 GPT-generated prompts per image, totaling 120 evaluation videos.
Measures temporal consistency across frames. Higher values indicate smoother, more coherent motion without jittering or flickering.
Quantifies the amount of motion in generated videos. Higher values indicate more dynamic, non-static video content.
Measures how well the generated video aligns with the input text prompt using CLIP embeddings.
Measures subject fidelity using CLIP and DINO embeddings. Higher DINO-I indicates better preservation of fine-grained subject details.
We compare against two categories of baselines: (1) Custom T2V methods that directly train text-to-video models on subject-video pairs (Phantom, VACE, VideoBooth), and (2) Custom T2I + I2V methods that combine subject-driven image generation with image-to-video models (OmniControl+I2V, BLIP+I2V, IP-Adapter+I2V).
Note that we utilize publicly available versions for Phantom and VACE based on Wan-2.1 1.3B for fair comparison.
| Method | Dataset | A100 h | Motion Smooth.↑ | Dynamic Deg.↑ | CLIP-T↑ | CLIP-I↑ | DINO-I↑ |
|---|---|---|---|---|---|---|---|
| Custom T2V Methods | |||||||
| Phantom | 1M pairs | ~10K | 98.93 | 54.90 | 33.51 | 72.49 | 51.94 |
| VACE | 53M videos | ~70K | 98.68 | 40.00 | 33.60 | 73.35 | 52.68 |
| VideoBooth | 30K videos | - | 96.95 | 51.67 | 29.59 | 66.06 | 34.54 |
| Custom T2I + I2V Methods | |||||||
| OmniControl + I2V | - | - | 98.21 | 51.67 | 31.89 | 72.58 | 54.16 |
| BLIP + I2V | - | - | 97.53 | 49.17 | 28.19 | 79.29 | 56.58 |
| IP-Adapter + I2V | - | - | 97.21 | 55.83 | 26.97 | 73.86 | 45.18 |
| Ours | 200K S2I + 4K videos | 288 | 98.45 | 69.64 | 32.69 | 77.14 | 62.88 |
We evaluate on the Open-S2V Benchmark (single-domain track) without any per-subject tuning or domain-specific adaptation, using the same checkpoint trained on 200K S2I pairs and 4K unpaired proxy videos. This benchmark provides a comprehensive evaluation including aesthetics, motion quality, face similarity, and overall generation quality.
| Method | Training Cost | Total↑ | Aesthetics↑ | Motion↑ | FaceSim↑ | GmeScore↑ | NexusScore↑ | NaturalScore↑ |
|---|---|---|---|---|---|---|---|---|
| Closed-Source Models | ||||||||
| Vidu2.0 | - | 48.67% | 34.78% | 24.40% | 36.20% | 65.56% | 45.20% | 72.60% |
| Pika2.1 | - | 48.93% | 38.64% | 31.90% | 32.94% | 62.19% | 47.34% | 70.60% |
| Kling1.6 | - | 53.12% | 35.63% | 36.40% | 39.26% | 61.99% | 48.24% | 81.40% |
| VACE Series | ||||||||
| VACE-P1.3B | ≈70K h | 44.28% | 42.58% | 18.00% | 18.02% | 65.93% | 36.26% | 76.00% |
| VACE-1.3B | ≈70K h | 47.33% | 41.81% | 33.78% | 22.38% | 65.35% | 38.52% | 76.00% |
| VACE-14B | ≈210K h | 58.00% | 41.30% | 35.54% | 64.65% | 58.55% | 51.33% | 77.33% |
| Phantom Series | ||||||||
| Phantom-1.3B | ≈10K h | 49.95% | 42.98% | 19.30% | 44.03% | 65.61% | 37.78% | 76.00% |
| Phantom-14B | ≈30K h | 53.17% | 47.46% | 41.55% | 51.82% | 70.07% | 35.35% | 69.35% |
| Other Methods | ||||||||
| SkyReels-A2-P14B | - | 51.64% | 33.83% | 21.60% | 54.42% | 61.93% | 48.63% | 70.60% |
| HunyuanCustom | - | 51.64% | 34.08% | 26.83% | 55.93% | 54.31% | 50.75% | 68.66% |
| Ours | 288 h | 50.05% | 45.40% | 19.38% | 18.05% | 70.53% | 41.23% | 68.52% |
Efficiency Highlight: Our 288 A100-hours constitute roughly 35 to 104x lower cost than Phantom-1.3B/14B (~10K to 30K hours) and 240 to 730x lower than VACE-1.3B/14B (~70K to 210K hours), yet yield comparable Total Score to several baselines.
A key advantage of our approach is the dramatic reduction in training cost while maintaining competitive performance. The table below summarizes the data and compute requirements across different S2V methods.
| Method | Dataset Size | Base Model | A100 Hours | Domain |
|---|---|---|---|---|
| Per-subject Tuned Methods (require fine-tuning for each subject) | ||||
| CustomCrafter | 200 reg. images / subj. | VideoCrafter2 (1.4B) | ~200 / subj. | Object |
| Still-Moving | few images + 40 videos | Lumiere (1.2B) | - | Face/Object |
| Tuning-free Methods (zero-shot generalization) | ||||
| VACE | ~53M videos | LTX & Wan (14B) | 70K to 210K | General |
| Phantom | 1M subject-video pairs | Wan (1.3 to 14B) | 10K to 30K | Face/Object |
| Consis-ID | 130K clips | CogVideoX (5B) | - | Face |
| Ours | 200K S2I + 4K videos | CogVideoX (5B) | 288 | Object |
Our method achieves 35 to 730x lower training cost than tuning-free baselines while requiring no per-subject fine-tuning at inference. This efficiency comes from our dual-task formulation that leverages abundant S2I pairs for identity injection while using only a small set of unpaired proxy videos (~1% of Pexels-400K) for temporal awareness preservation.