Tea-Adapter: Teacher Adapter for Efficient Conditional Generation

Yinhan Zhang1,2*, Yue Ma3*, Fangqiu Yi2, Chenyang Qi3, Chi Zhang2, Zeyu Wang1,3
The Hong Kong University of Science and Technology (Guangzhou)1, Institute of Artificial Intelligence (TeleAI), China Telecom2, The Hong Kong University of Science and Technology3

Abstract

We propose Tea-Adapter, a plug-and-play adapter designed to efficiently integrate conditional knowledge from a smaller teacher model into a larger student video diffusion model. Existing controllable video DiT methods face critical challenges: full fine-tuning of billion-parameter models is extremely expensive, while cascaded ControlNets introduce substantial parameter overhead and exhibit limited flexibility for novel multi-condition compositions.To overcome these issues, Tea-Adapter introduces a novel reverse distillation method that enables large video diffusion models to inherit precise control capabilities from smaller, efficiently tuned teacher diffusion models, eliminating the need for full fine-tuning. Moreover, recognizing the intrinsic relationships between different conditions, we replace the cascaded ControlNet design with a Mixture of Condition Experts (MCE) layer. This structure dynamically routes diverse conditional inputs within a unified architecture, supporting both single-condition control and multiple condition combinations without additional training cost. To achieve cross-scale knowledge transfer, we further develop a Feature Propagation Module to ensure efficient and temporally consistent feature propagation across video frames.Experiments demonstrate that Tea-Adapter enables high-fidelity, multi-condition video synthesis, making advanced, controllable video generation feasible on low-resource hardware.

PDF BibTeX
BibTeX copied to clipboard