The HuggingFace Datasets Hub hosts thousands of datasets. This provides exciting opportunities for language model training and evaluation. However, the datasets for a given type of task are stored with different schemas, and harmonization is harder than it seems (https://xkcd.com/927/). Multi-task training or evaluation requires manual work to fit data into task templates. Various initiatives independently address this problem by releasing the harmonized datasets or harmonization codes to preprocess datasets to the same format. We identify patterns across previous preprocessings, e.g. mapping of column names, and extraction of a specific sub-field from structured data in a column, and propose a structured annotation framework that makes our annotations fully exposed and not buried in unstructured code. We release a dataset annotation framework and dataset annotations for more than 400 English tasks (https://github.com/sileod/tasksource). These annotations provide metadata, like the name of the columns that should be used as input or labels for all datasets, and can save time for future dataset preprocessings, even if they do not use our framework. We fine-tune a multi-task text encoder on all tasksource tasks, outperforming every publicly available text encoder of comparable size on an external evaluation https://hf.co/sileod/deberta-v3-base-tasksource-nli.
翻译:Hugging Faces Datas 枢纽包含数千个数据集。 这为语言模式培训和评估提供了令人兴奋的机会。 但是, 某类任务的数据集存储于不同的系统, 并且比看起来( https:// xkcd.com/ 927) 更加困难。 多任务培训或评估需要手工工作才能将数据放入任务模板。 各种举措独立解决这个问题, 将统一数据集或统一代码放入预处理数据集到相同的格式。 我们确定以往预处理中的模式, 例如, 绘制列名, 从一个列中提取特定亚域的结构化数据, 并提议一个结构化说明框架, 使得我们的说明被完全暴露, 而不是隐藏在非结构化的代码中。 我们为400多项英国任务( https://github.com/selileod/tasksource) 发布一个数据集说明框架和数据集说明。 这些说明提供了元数据, 如所有预处理中应该用作输入或标签的列的名称, 从一个列中提取一个特定的子字段, 并且可以保存未来数据库/ developlix- weco ad- comstandeal ask ask ask a listrattidustration