The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Notably, under our experimental setup, compared with training a diffusion model from scratch, evaluating with TED-6K is about \textbf{750$\times$ faster}. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our TED-6K dataset and evaluation code are available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.
翻译:文本编码器是文生图与文生视频扩散模型的关键组件,从根本上决定了生成内容的语义保真度。然而,其发展一直受到两大挑战的阻碍:一是缺乏能够可靠预测下游生成性能的高效评估框架,二是难以将预训练语言模型有效适配于视觉合成任务。为解决这些问题,我们提出了GRAN-TED,一种为扩散模型生成鲁棒、对齐且细致的文本嵌入的新范式。我们的贡献包含两方面。首先,我们提出了TED-6K,一个新颖的纯文本基准测试集,它能够在无需昂贵端到端模型训练的情况下,高效且鲁棒地评估编码器的表征质量。我们证明,通过一个轻量级统一适配器进行标准化后,在TED-6K上的性能与编码器在下游生成任务中的有效性高度相关。值得注意的是,在我们的实验设置下,与从头训练扩散模型相比,使用TED-6K进行评估的速度大约快 \textbf{750$\times$}。其次,在这一经过验证的框架指导下,我们采用一种新颖的两阶段训练范式开发了一个更优的文本编码器。该过程包括:首先在多模态大语言模型上进行初步微调以获得更好的视觉表征,随后采用分层加权方法来提取更细致、更强大的文本特征。我们的实验表明,所得的GRAN-TED编码器不仅在TED-6K上达到了最先进的性能,而且在文生图与文生视频生成任务中带来了显著的性能提升。我们的TED-6K数据集与评估代码可通过以下链接获取:https://anonymous.4open.science/r/GRAN-TED-4FCC/。