We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.
翻译:我们介绍了EmbeddingGemma,一种基于Gemma 3语言模型系列的新型轻量化开源文本嵌入模型。我们创新的训练方法通过编码器-解码器初始化与几何嵌入蒸馏策略性地从更大模型中捕获知识。我们利用扩散正则化器提升了模型的鲁棒性与表达能力,并通过融合来自多样化优化混合的检查点确保了泛化能力。在多语言、英语及代码领域的海量文本嵌入基准(MTEB)上评估,EmbeddingGemma(300M参数)取得了最先进的成果。值得注意的是,它在参数量少于5亿的情况下超越了先前包括专有和开源模型在内的顶尖模型,且性能可与规模为其两倍的模型相媲美,提供了卓越的性能-成本比。显著的是,这一优势在量化模型权重或截断嵌入输出时依然保持。这使得EmbeddingGemma特别适用于低延迟和高吞吐量的应用场景,例如设备端应用。我们提供了消融研究以探讨关键设计选择。我们将EmbeddingGemma向社区开源,以推动进一步研究。