Tea-Adapter: Teacher Adapter for Efficient Conditional Generation
Abstract
We propose Tea-Adapter, a plug-and-play adapter designed to efficiently integrate conditional knowledge from a smaller teacher model into a larger student video diffusion model. Existing controllable video DiT methods face critical challenges: full fine-tuning of billion-parameter models is extremely expensive, while cascaded ControlNets introduce substantial parameter overhead and exhibit limited flexibility for novel multi-condition compositions.To overcome these issues, Tea-Adapter introduces a novel reverse distillation method that enables large video diffusion models to inherit precise control capabilities from smaller, efficiently tuned teacher diffusion models, eliminating the need for full fine-tuning. Moreover, recognizing the intrinsic relationships between different conditions, we replace the cascaded ControlNet design with a Mixture of Condition Experts (MCE) layer. This structure dynamically routes diverse conditional inputs within a unified architecture, supporting both single-condition control and multiple condition combinations without additional training cost. To achieve cross-scale knowledge transfer, we further develop a Feature Propagation Module to ensure efficient and temporally consistent feature propagation across video frames.Experiments demonstrate that Tea-Adapter enables high-fidelity, multi-condition video synthesis, making advanced, controllable video generation feasible on low-resource hardware.
