We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM's causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.
翻译:我们提出了C2LLM——对比代码大语言模型,这是一个包含0.5B和7B两种规模的代码嵌入模型家族。基于Qwen-2.5-Coder主干网络,C2LLM采用多头注意力池化模块从词元嵌入生成序列嵌入,该模块有效地:1) 利用了LLM在预训练期间获得的因果表征;同时2) 能够聚合序列中所有词元的信息,打破了基于EOS的序列嵌入存在的信息瓶颈;并且3) 支持嵌入维度的灵活适配,可作为MRL的替代方案。在300万公开可用数据上训练后,C2LLM模型在同等规模模型中于MTEB-Code基准上创造了新纪录,其中C2LLM-7B在总排行榜上位列第一。