Automatic extraction of product attributes from their textual descriptions is essential for online shopper experience. One inherent challenge of this task is the emerging nature of e-commerce products -- we see new types of products with their unique set of new attributes constantly. Most prior works on this matter mine new values for a set of known attributes but cannot handle new attributes that arose from constantly changing data. In this work, we study the attribute mining problem in an open-world setting to extract novel attributes and their values. Instead of providing comprehensive training data, the user only needs to provide a few examples for a few known attribute types as weak supervision. We propose a principled framework that first generates attribute value candidates and then groups them into clusters of attributes. The candidate generation step probes a pre-trained language model to extract phrases from product titles. Then, an attribute-aware fine-tuning method optimizes a multitask objective and shapes the language model representation to be attribute-discriminative. Finally, we discover new attributes and values through the self-ensemble of our framework, which handles the open-world challenge. We run extensive experiments on a large distantly annotated development set and a gold standard human-annotated test set that we collected. Our model significantly outperforms strong baselines and can generalize to unseen attributes and product types.
翻译:从文字描述中自动提取产品属性对于在线浏览经验至关重要。这项任务的一个固有挑战是电子商务产品的新兴性质 -- -- 我们不断看到新型产品及其独特的新属性。大多数以前关于该问题的工作都利用一组已知属性的新值,但无法处理不断变化的数据所产生的新属性。在这项工作中,我们在开放世界环境中研究采矿属性问题,以提取新的属性及其价值。用户只需为少数已知属性类型提供几个例子,即薄弱的监管。我们提出了一个原则性框架,首先生成属性值候选人,然后将其分组为属性组合。候选人生成步骤探索一个预先培训的语言模型,从产品标题中提取短语。然后,一个属性认知微调方法优化多任务目标,并塑造语言模型的表达方式,以提取新的属性和价值。最后,我们通过处理开放世界挑战的自构框架,发现新的属性和价值。我们对一个庞大的远方位发展模型进行了广泛的实验,并且我们收集了一个高清晰的、高清晰的模型,我们用来测试了我们所收集的、高清晰的金质标准模型。