Safeguard models help large language models (LLMs) detect and block harmful content, but most evaluations remain English-centric and overlook linguistic and cultural diversity. Existing multilingual safety benchmarks often rely on machine-translated English data, which fails to capture nuances in low-resource languages. Southeast Asian (SEA) languages are underrepresented despite the region's linguistic diversity and unique safety concerns, from culturally sensitive political speech to region-specific misinformation. Addressing these gaps requires benchmarks that are natively authored to reflect local norms and harm scenarios. We introduce SEA-SafeguardBench, the first human-verified safety benchmark for SEA, covering eight languages, 21,640 samples, across three subsets: general, in-the-wild, and content generation. The experimental results from our benchmark demonstrate that even state-of-the-art LLMs and guardrails are challenged by SEA cultural and harm scenarios and underperform when compared to English texts.
翻译:防护模型有助于大型语言模型(LLM)检测并阻止有害内容,但大多数评估仍以英语为中心,忽视了语言与文化的多样性。现有的多语言安全基准通常依赖于机器翻译的英语数据,这无法捕捉低资源语言中的细微差别。东南亚(SEA)语言尽管该地区语言多样且存在独特的安全问题(从文化敏感的政治言论到特定区域的错误信息),却代表性不足。解决这些差距需要原生创作的基准,以反映当地规范与危害场景。我们推出了SEA-SafeguardBench,这是首个针对东南亚的人工验证安全基准,涵盖八种语言、21,640个样本,分为三个子集:通用、野外和内容生成。我们基准的实验结果表明,即使是最先进的LLM和防护措施也面临东南亚文化与危害场景的挑战,与英语文本相比表现不佳。