Large language models (LLMs) have achieved remarkable performance across diverse natural language processing tasks, yet their vulnerability to character-level adversarial manipulations presents significant security challenges for real-world deployments. This paper presents a study of different special character attacks including unicode, homoglyph, structural, and textual encoding attacks aimed at bypassing safety mechanisms. We evaluate seven prominent open-source models ranging from 3.8B to 32B parameters on 4,000+ attack attempts. These experiments reveal critical vulnerabilities across all model sizes, exposing failure modes that include successful jailbreaks, incoherent outputs, and unrelated hallucinations.
翻译:大型语言模型(LLMs)在多种自然语言处理任务中取得了显著性能,然而其对字符级对抗性操作的脆弱性为实际部署带来了重大的安全挑战。本文研究了包括Unicode、同形异义字符、结构性和文本编码攻击在内的不同特殊字符攻击,旨在绕过安全机制。我们在超过4000次攻击尝试中评估了七个参数规模从3.8B到32B的知名开源模型。这些实验揭示了所有模型规模均存在关键漏洞,暴露了包括成功越狱、不连贯输出和无关幻觉在内的失效模式。