Abstract

With the rapid development of generative artificial intelligence, significant advances have been made in diffusion model-based Text-to-Image (T2I) and Text-to-Video (T2V) technologies. Recent studies have introduced Direct Preference Optimization (DPO) to T2I tasks, greatly improving human preferences in generated images. However, current T2V methods lack a complete pipeline and specific loss function to align generated videos with human preferences via DPO. Moreover, challenges such as the lack of sufficient paired video preference data prevent effective model training. Additional, the SD v1-4 weights lack the capability to maintain spatiotemporal consistency during video generation, which may constrain the model’s flexibility and lead to lower-quality outputs. In response, we propose three solutions: 1) Our work implements the integration of the DPO fine-tuning strategy into T2V tasks. By deriving a carefully structured loss function, we utilize human feedback to align video generation with human preferences. We refer to this new method as \textbf{HuViDPO}. 2) Our work constructs a small-scale Human Preference Video Pair Dataset to fulfill the core requirements of the DPO fine-tuning strategy, aiming to address the current scarcity of pairwise video preference datasets. 3) We proposed a DPO-based fine-tuning strategy by adapting SD v1.4 for short video generation, achieving clear visual improvements over baselines. Additionally, we verified its strong performance in T2V customization tasks.

🔥 Learn T2V Generation with Only 8~16 Videos 🔥

Training pipeline of our HuViDPO. Training process can be divided into two stages: (a) Training the Attention Block and Temporal-Spatial layers using basic training settings to improve the spatiotemporal consistency. (b) Fine-tuning the model, with LoRA added and other layers frozen, using small-scale human preference datasets and DPO strategy to enhance its alignment with human preferences. In phase (b), lossw and lossl denote the loss values computed by inputting winning and losing videos into the fine-tuned model, while losswref and losslref are the loss values obtained by inputting the same videos into the reference model.

🎬 Text-to-Video Results 🎬


fireworks exploding over a calm ocean at night.	fireworks exploding over a futuristic city skyline with neon lights.	fireworks lighting up a stormy sky over the ocean.

a musician playing the guitar beside a crackling campfire on a cool night.	an elven bard playing a magical guitar in an enchanted forest.	a warrior playing a glowing guitar that summons lightning in a stormy sky.

a LEGO helicopter flies in the sky.	a white helicopter flies at night.	a helicopter flies in the rainy day.

a majestic waterfall cascading into a crystal-clear pool surrounded by lush greenery.	a peaceful waterfall flowing through a serene forest, with cherry blossoms gently floating in the air.	a waterfall flowing in slow motion, droplets shimmering in the sunlight, with a sense of serene beauty.

🎞 Qualitative comparison with VDM, AnimateDiff and LAMP 🎞

LVDM	AnimateDiff	LAMP	Ours

birds fly in the pink sky.

fireworks exploding in the clouds over a rainbow bridge.

a person playing the guitar under the stars in a quiet desert.

rain falling in an enchanted forest, glowing magical creatures dancing in the mist.

SD v1.4	SD v1.5	SD-XL	Ours

birds flying over a misty forest at dawn.

a horse runs in the snow.

a waterfall illuminated by the golden light of sunset, mist rising into the air.

a handsome man.

HuViDPO: Enhancing Video Generation through Direct Preference Optimization for Human-Centric Alignment

Abstract

🔥 Learn T2V Generation with Only 8~16 Videos 🔥

🎬 Text-to-Video Results 🎬

🎞 Qualitative comparison with VDM, AnimateDiff and LAMP 🎞

🎞 Qualitative comparison with SD-v1.4, SD-v1.5 and SD-XL. 🎞

🎬 Pixel-art-style 🎬

🎬 Claymation-style 🎬

🎬 Realistic-style 🎬

🎬 Cartoon-style 🎬

BibTeX