Back to Main Page

Quantitative Evaluation

Detailed benchmark results for subject-driven video generation

We evaluate our method on two standard benchmarks: VBench for video generation quality and Open-S2V Benchmark for subject-driven video generation. Our evaluation uses 30 reference images from state-of-the-art image customization papers along with the DreamBooth dataset, with 4 GPT-generated prompts per image, totaling 120 evaluation videos.


Evaluation Metrics

Motion Smoothness

Measures temporal consistency across frames. Higher values indicate smoother, more coherent motion without jittering or flickering.

Dynamic Degree

Quantifies the amount of motion in generated videos. Higher values indicate more dynamic, non-static video content.

CLIP-T (Text Alignment)

Measures how well the generated video aligns with the input text prompt using CLIP embeddings.

CLIP-I / DINO-I (Identity)

Measures subject fidelity using CLIP and DINO embeddings. Higher DINO-I indicates better preservation of fine-grained subject details.


VBench Evaluation

We compare against two categories of baselines: (1) Custom T2V methods that directly train text-to-video models on subject-video pairs (Phantom, VACE, VideoBooth), and (2) Custom T2I + I2V methods that combine subject-driven image generation with image-to-video models (OmniControl+I2V, BLIP+I2V, IP-Adapter+I2V).

Note that we utilize publicly available versions for Phantom and VACE based on Wan-2.1 1.3B for fair comparison.

Method Dataset A100 h Motion Smooth.↑ Dynamic Deg.↑ CLIP-T↑ CLIP-I↑ DINO-I↑
Custom T2V Methods
Phantom 1M pairs ~10K 98.93 54.90 33.51 72.49 51.94
VACE 53M videos ~70K 98.68 40.00 33.60 73.35 52.68
VideoBooth 30K videos - 96.95 51.67 29.59 66.06 34.54
Custom T2I + I2V Methods
OmniControl + I2V - - 98.21 51.67 31.89 72.58 54.16
BLIP + I2V - - 97.53 49.17 28.19 79.29 56.58
IP-Adapter + I2V - - 97.21 55.83 26.97 73.86 45.18
Ours 200K S2I + 4K videos 288 98.45 69.64 32.69 77.14 62.88

Key Findings

  • Highest Dynamic Degree (69.64): Our method generates videos with significantly more motion compared to all baselines, avoiding the "near-static" problem common in S2I+I2V approaches.
  • Best Identity Preservation (DINO-I: 62.88): Our dual-task learning effectively preserves subject identity, outperforming even methods trained on orders of magnitude more data.
  • Competitive Motion Smoothness (98.45): Despite emphasizing dynamics, our videos maintain temporal coherence comparable to state-of-the-art methods.
  • Balanced Trade-off: Among T2I+I2V pipelines, BLIP+I2V achieves highest CLIP-I (79.29) but with weaker dynamics; our method offers a more balanced trade-off between motion and subject fidelity.

Open-S2V Benchmark Evaluation

We evaluate on the Open-S2V Benchmark (single-domain track) without any per-subject tuning or domain-specific adaptation, using the same checkpoint trained on 200K S2I pairs and 4K unpaired proxy videos. This benchmark provides a comprehensive evaluation including aesthetics, motion quality, face similarity, and overall generation quality.

Method Training Cost Total↑ Aesthetics↑ Motion↑ FaceSim↑ GmeScore↑ NexusScore↑ NaturalScore↑
Closed-Source Models
Vidu2.0 - 48.67% 34.78% 24.40% 36.20% 65.56% 45.20% 72.60%
Pika2.1 - 48.93% 38.64% 31.90% 32.94% 62.19% 47.34% 70.60%
Kling1.6 - 53.12% 35.63% 36.40% 39.26% 61.99% 48.24% 81.40%
VACE Series
VACE-P1.3B ≈70K h 44.28% 42.58% 18.00% 18.02% 65.93% 36.26% 76.00%
VACE-1.3B ≈70K h 47.33% 41.81% 33.78% 22.38% 65.35% 38.52% 76.00%
VACE-14B ≈210K h 58.00% 41.30% 35.54% 64.65% 58.55% 51.33% 77.33%
Phantom Series
Phantom-1.3B ≈10K h 49.95% 42.98% 19.30% 44.03% 65.61% 37.78% 76.00%
Phantom-14B ≈30K h 53.17% 47.46% 41.55% 51.82% 70.07% 35.35% 69.35%
Other Methods
SkyReels-A2-P14B - 51.64% 33.83% 21.60% 54.42% 61.93% 48.63% 70.60%
HunyuanCustom - 51.64% 34.08% 26.83% 55.93% 54.31% 50.75% 68.66%
Ours 288 h 50.05% 45.40% 19.38% 18.05% 70.53% 41.23% 68.52%

Efficiency Highlight: Our 288 A100-hours constitute roughly 35 to 104x lower cost than Phantom-1.3B/14B (~10K to 30K hours) and 240 to 730x lower than VACE-1.3B/14B (~70K to 210K hours), yet yield comparable Total Score to several baselines.

Analysis

  • Total Score (50.05%): Comparable to Phantom-1.3B (49.95%) and within range of recent closed models.
  • Highest GmeScore (70.53%): Indicates strong general video generation quality.
  • Strong Aesthetics (45.40%): Second only to Phantom-14B (47.46%), demonstrating that S2I-driven identity injection preserves visual appeal.
  • Competitive NaturalScore (68.52%): Indicates that S2I-driven identity injection coupled with sparse video replay preserves visual realism.
  • Motion Score (19.38%): Lags larger video-centric models, consistent with our compute-efficient training that predominantly optimizes on images.
  • FaceSim (18.05%): Lower than face-specialized models, aligning with our limitation that Subject-200K contains few human faces. However, similar to VACE-P1.3B (18.02%).

Training Cost Comparison

A key advantage of our approach is the dramatic reduction in training cost while maintaining competitive performance. The table below summarizes the data and compute requirements across different S2V methods.

Method Dataset Size Base Model A100 Hours Domain
Per-subject Tuned Methods (require fine-tuning for each subject)
CustomCrafter 200 reg. images / subj. VideoCrafter2 (1.4B) ~200 / subj. Object
Still-Moving few images + 40 videos Lumiere (1.2B) - Face/Object
Tuning-free Methods (zero-shot generalization)
VACE ~53M videos LTX & Wan (14B) 70K to 210K General
Phantom 1M subject-video pairs Wan (1.3 to 14B) 10K to 30K Face/Object
Consis-ID 130K clips CogVideoX (5B) - Face
Ours 200K S2I + 4K videos CogVideoX (5B) 288 Object

Our method achieves 35 to 730x lower training cost than tuning-free baselines while requiring no per-subject fine-tuning at inference. This efficiency comes from our dual-task formulation that leverages abundant S2I pairs for identity injection while using only a small set of unpaired proxy videos (~1% of Pexels-400K) for temporal awareness preservation.


Back to Main Page