Large language models (LLMs) have shown remarkable progress in reasoning abilities and general natural language processing (NLP) tasks, yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored. This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs, with a special emphasis on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP tasks. We experiment with various strategies, including zero-shot, few-shot, and fine-tuning. This allows us to systematically evaluate performance on datasets covering a range of applications to examine their capacity for linguistic reasoning under different levels of complexity. Our experiments reveal several key findings. First, carefully selecting just three in-context examples delivers an average uplift of over 13 F1 points on classification tasks-boosting sentiment analysis from 35.3% to 87.5% and paraphrase detection from 56.1% to 87.0%. Second, reasoning-focused DeepSeek architectures outperform a strong GPT o4-mini baseline by an average of 12 F1 points on complex inference tasks in the zero-shot setting. Third, LoRA-based fine-tuning yields up to an additional 8 points in F1 and BLEU compared to equivalent increases in model scale. The code is available at https://github.com/gufranSabri/deepseek-evals
翻译:大语言模型(LLMs)在推理能力和通用自然语言处理(NLP)任务上已展现出显著进展,然而,针对具有丰富形态、多样方言和复杂书写体系的阿拉伯语数据,其性能仍未被充分探索。本文对多种专注于推理的大语言模型进行了全面的基准测试研究,特别关注新引入的DeepSeek模型,覆盖了十五项阿拉伯语NLP任务。我们实验了多种策略,包括零样本、少样本和微调。这使我们能够系统性地评估模型在涵盖一系列应用的数据集上的表现,以检验其在不同复杂度下的语言推理能力。我们的实验揭示了若干关键发现。首先,在分类任务中,仅精心选择三个上下文示例即可带来平均超过13个F1分数的提升——将情感分析从35.3%提升至87.5%,将复述检测从56.1%提升至87.0%。其次,在零样本设置下,专注于推理的DeepSeek架构在复杂推理任务上的平均F1分数比强大的GPT-4-mini基线高出12分。第三,与模型规模的等效增加相比,基于LoRA的微调在F1和BLEU指标上可带来高达8分的额外提升。代码可在 https://github.com/gufranSabri/deepseek-evals 获取。