A Comprehensive Ecosystem for
Open-Domain Customized Video Generation

ICASSP 2026
Jingxu Zhang1,2, Yuqian Hong1, Daneul Kim3, Kai Qiu2, Qi Dai2, Jianmin Bao2, Yifan Yang2, Xiaoyan Sun1, Chong Luo2
1University of Science and Technology of China    2Microsoft Research Asia    3Seoul National University
CustomDiT identity-preserving video generation examples
Given a single reference image of any subject (leftmost column), CustomDiT generates temporally coherent video frames that faithfully preserve the subject's identity while following diverse text prompts.

Abstract

Recent progress in video generation has shown impressive visual synthesis capabilities. However, open-domain customized video generation remains limited by the lack of large-scale, annotated datasets capturing diverse identity-specific attributes. To address this, we introduce PexelsCustom-1M, the first publicly available million-scale dataset for identity-preserving video generation, containing one million curated ⟨identity, text, video⟩ triplets across 8,000+ categories.

Leveraging this, we propose CustomDiT, a parameter-efficient framework that adapts a pretrained multimodal Diffusion Transformer into a customized video generator with only 8% additional learnable parameters. Our method surpasses prior state-of-the-art on both existing and our new benchmarks.

To overcome the limited coverage of existing benchmarks (e.g., DreamBooth covers only 100 classes), we construct OpenCustom, a comprehensive evaluation benchmark with 1,000+ categories, created via cross-dataset knowledge fusion from ImageNet and MS-COCO. We open-source the entire ecosystem—including dataset, pipeline, benchmark, and implementations—to support further research.

Generation Results

Given a single reference image, CustomDiT generates identity-preserving videos following the text prompt in a zero-shot manner.

Reference: Samoyed Reference
"A Samoyed walks through a snowy forest, blending with the snow-covered ground"
Reference: cat Reference
"a red cat"
Reference: sheep 2 Reference
"A sheep grazes in a lush meadow as the camera pans across the rolling hills"
Reference: bell cote Reference
"A bell cote atop a village church, surrounded by green fields and wildflowers"
Reference: triumphal arch Reference
"A triumphal arch stands in a bustling city square as the camera circles around"
Reference: donut Reference
"A donut sits on a kitchen counter as the camera slowly pans around"

PexelsCustom-1M Dataset

The first million-scale, publicly available dataset for open-domain identity-preserving video generation. Each sample is an ⟨identity, text, video⟩ triplet with subject-centric captions and precise segmentation masks.

1M
Annotated Triplets
8,373
Identity Categories
239K
Source Videos
Data curation pipeline
Data curation pipeline of PexelsCustom-1M. The pipeline enriches data during pre-processing (Florence-2, GroundingDINO, SAM2) while filtering and generating subject-centric captions via GPT-4o during post-processing.

CustomDiT

CustomDiT conditions text-to-video generation on identity-aware reference images via bias-injected RoPE embeddings, while LoRA layers enable efficient adaptation with only 8% additional learnable parameters. Training follows a two-stage curriculum: Stage 1 without data augmentation (WoDA) for identity preservation, and Stage 2 with data augmentation (WtDA) for generalization.

CustomDiT architecture
Overview of CustomDiT. Training pipeline (above) shows how reference image latents are injected via concatenation with shifted positional embeddings. Inference (below) demonstrates zero-shot generation given a reference image and text prompt.

Qualitative Comparison

Side-by-side comparison with baseline methods. CustomDiT (rightmost, highlighted) best preserves subject identity while generating natural, dynamic videos.


Benchmark Results

Quantitative comparison on DreamBooth-Custom and OpenCustom benchmarks. CustomDiT achieves the best identity preservation (CLIP-I, DINO-I) and dynamic degree while maintaining competitive text alignment.

Method Benchmark M.S. ↑ D.D. ↑ CLIP-T ↑ CLIP-I ↑ DINO-I ↑
OminiControlDreamBooth98.9124.0030.4568.6443.69
OpenCustom98.6938.6431.3464.5734.69
MS-DiffusionDreamBooth99.259.0030.0876.2162.67
OpenCustom98.9020.3631.5175.2359.44
BLIP-DiffusionDreamBooth98.934.0027.6476.3654.21
OpenCustom98.5120.9328.7976.0554.60
IP-AdapterDreamBooth98.937.0028.9676.5254.57
OpenCustom98.6229.5030.8674.2149.00
VideoBoothDreamBooth96.9750.0027.2561.6331.38
OpenCustom96.6157.8628.2067.6941.48
ID-AnimatorDreamBooth99.305.0030.9467.2934.62
OpenCustom99.148.7931.5066.8134.38
CustomDiT (Ours)DreamBooth97.6661.0029.1776.9366.59
OpenCustom97.4270.2930.9675.3265.80

BibTeX

@inproceedings{zhang2026comprehensive, title={A Comprehensive Ecosystem for Open-Domain Customized Video Generation}, author={Zhang, Jingxu and Hong, Yuqian and Kim, Daneul and Qiu, Kai and Dai, Qi and Bao, Jianmin and Yang, Yifan and Sun, Xiaoyan and Luo, Chong}, booktitle={ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={9167--9171}, year={2026}, organization={IEEE} }