PexelsCustom-1M: A Comprehensive Ecosystem for Open-Domain Customized Video Generation

CustomDiT identity-preserving video generation examples

Given a single reference image of any subject (leftmost column), CustomDiT generates temporally coherent video frames that faithfully preserve the subject's identity while following diverse text prompts.

◆Abstract

Recent progress in video generation has shown impressive visual synthesis capabilities. However, open-domain customized video generation remains limited by the lack of large-scale, annotated datasets capturing diverse identity-specific attributes. To address this, we introduce PexelsCustom-1M, the first publicly available million-scale dataset for identity-preserving video generation, containing one million curated ⟨identity, text, video⟩ triplets across 8,000+ categories.

Leveraging this, we propose CustomDiT, a parameter-efficient framework that adapts a pretrained multimodal Diffusion Transformer into a customized video generator with only 8% additional learnable parameters. Our method surpasses prior state-of-the-art on both existing and our new benchmarks.

To overcome the limited coverage of existing benchmarks (e.g., DreamBooth covers only 100 classes), we construct OpenCustom, a comprehensive evaluation benchmark with 1,000+ categories, created via cross-dataset knowledge fusion from ImageNet and MS-COCO. We open-source the entire ecosystem—including dataset, pipeline, benchmark, and implementations—to support further research.

◆Generation Results

Given a single reference image, CustomDiT generates identity-preserving videos following the text prompt in a zero-shot manner.

Reference

"A Samoyed walks through a snowy forest, blending with the snow-covered ground"

Reference

"a red cat"

Reference

"A sheep grazes in a lush meadow as the camera pans across the rolling hills"

Reference

"A bell cote atop a village church, surrounded by green fields and wildflowers"

Reference

"A triumphal arch stands in a bustling city square as the camera circles around"

Reference

"A donut sits on a kitchen counter as the camera slowly pans around"

◆PexelsCustom-1M Dataset

The first million-scale, publicly available dataset for open-domain identity-preserving video generation. Each sample is an ⟨identity, text, video⟩ triplet with subject-centric captions and precise segmentation masks.

Annotated Triplets

8,373

Identity Categories

239K

Source Videos

Data curation pipeline of PexelsCustom-1M. The pipeline enriches data during pre-processing (Florence-2, GroundingDINO, SAM2) while filtering and generating subject-centric captions via GPT-4o during post-processing.

◆CustomDiT

CustomDiT conditions text-to-video generation on identity-aware reference images via bias-injected RoPE embeddings, while LoRA layers enable efficient adaptation with only 8% additional learnable parameters. Training follows a two-stage curriculum: Stage 1 without data augmentation (WoDA) for identity preservation, and Stage 2 with data augmentation (WtDA) for generalization.

Overview of CustomDiT. Training pipeline (above) shows how reference image latents are injected via concatenation with shifted positional embeddings. Inference (below) demonstrates zero-shot generation given a reference image and text prompt.

◆Qualitative Comparison

Side-by-side comparison with baseline methods. CustomDiT (rightmost, highlighted) best preserves subject identity while generating natural, dynamic videos.

Prompt: "A lamp in a log cabin, a cozy fireplace crackling nearby."

Reference

VideoBooth

BLIP-Diff.

IP-Adapter

MS-Diff.

OminiCtrl

Ours

Prompt: "a dog on top of a purple rug in a forest"

Reference

VideoBooth

BLIP-Diff.

IP-Adapter

MS-Diff.

OminiCtrl

Ours

Prompt: "Golden retriever walks on sandy beach with waves and seagulls."

Reference

VideoBooth

BLIP-Diff.

IP-Adapter

MS-Diff.

OminiCtrl

Ours

Prompt: "Pokemon plushie made of crochet."

Reference

VideoBooth

BLIP-Diff.

IP-Adapter

MS-Diff.

OminiCtrl

Ours

◆Benchmark Results

Quantitative comparison on DreamBooth-Custom and OpenCustom benchmarks. CustomDiT achieves the best identity preservation (CLIP-I, DINO-I) and dynamic degree while maintaining competitive text alignment.

Method	Benchmark	M.S. ↑	D.D. ↑	CLIP-T ↑	CLIP-I ↑	DINO-I ↑
OminiControl	DreamBooth	98.91	24.00	30.45	68.64	43.69
OminiControl	OpenCustom	98.69	38.64	31.34	64.57	34.69
MS-Diffusion	DreamBooth	99.25	9.00	30.08	76.21	62.67
MS-Diffusion	OpenCustom	98.90	20.36	31.51	75.23	59.44
BLIP-Diffusion	DreamBooth	98.93	4.00	27.64	76.36	54.21
BLIP-Diffusion	OpenCustom	98.51	20.93	28.79	76.05	54.60
IP-Adapter	DreamBooth	98.93	7.00	28.96	76.52	54.57
IP-Adapter	OpenCustom	98.62	29.50	30.86	74.21	49.00
VideoBooth	DreamBooth	96.97	50.00	27.25	61.63	31.38
VideoBooth	OpenCustom	96.61	57.86	28.20	67.69	41.48
ID-Animator	DreamBooth	99.30	5.00	30.94	67.29	34.62
ID-Animator	OpenCustom	99.14	8.79	31.50	66.81	34.38
CustomDiT (Ours)	DreamBooth	97.66	61.00	29.17	76.93	66.59
CustomDiT (Ours)	OpenCustom	97.42	70.29	30.96	75.32	65.80

◆BibTeX

@inproceedings{zhang2026comprehensive, title={A Comprehensive Ecosystem for Open-Domain Customized Video Generation}, author={Zhang, Jingxu and Hong, Yuqian and Kim, Daneul and Qiu, Kai and Dai, Qi and Bao, Jianmin and Yang, Yifan and Sun, Xiaoyan and Luo, Chong}, booktitle={ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={9167--9171}, year={2026}, organization={IEEE} }

A Comprehensive Ecosystem forOpen-Domain Customized Video Generation

◆Abstract

◆Generation Results

◆PexelsCustom-1M Dataset

◆CustomDiT

◆Qualitative Comparison

◆Benchmark Results

◆BibTeX

A Comprehensive Ecosystem for
Open-Domain Customized Video Generation