Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their candidate substitutions. To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT has been compiled following the ALEXSIS protocol for Spanish opening exciting new avenues for cross-lingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated four models for substitute generation on this dataset, namely mDistilBERT, mBERT, XLM-R, and BERTimbau. BERTimbau achieved the highest performance across all evaluation metrics.
翻译:法律简化(LS)是一项任务,即自动替换复杂单词,使各种目标人群(例如识字程度低的人、学习障碍的人、第二语言学习者)更容易获得文本。为了培训和测试模型,LS系统通常要求具有复杂单词特点的公司及其候选替代物。为了继续改进LS系统的性能,我们引入了LEXSIS-PT,这是巴西葡萄牙LS的一套新颖的多天体数据集,包含387个复杂单词的9 605个候选替代词。ALEXSIS-PT是遵照西班牙的ALEXSIS协议汇编的,为跨语言模型开辟了令人兴奋的新途径。ALEXSIS-PT是第一个LS-多天体化数据集,其中包含巴西报纸文章。我们评估了这数据集的四种替代生成模式,即MDTIRTERT、MBERT、XLM-R和BERTEDINBAU。 BERTEDIMBAU在所有评估指标中达到最高性性。