从专利到数据集：用于氧化物玻璃成分与性能的网络爬取技术 (From Patents to Dataset: Scraping for Oxide Glass Compositions and Properties)

Gustavo Laranja Thomaello,Thomaz Yeiden Busnardo Aguena,Eric Trevelato Costa,Rafael Baságlia Rosante,Thiago Rodrigo Ramos,Daiane Aparecida Zuanetti,Edgar Dutra Zanotto

In this work, we present web scraping techniques to extract in- formation from patent tables, clean and structure them for future use in predictive machine learning models to develop new glasses. We extracted compositions and three properties relevant to the development of new glasses and structured them into a database to be used together with information from other available datasets. We also analyzed the consistency of the information obtained and what it adds to the existing databases. The extracted liquidus temperatures comprise 5,696 compositions; the second subset includes 4,298 refractive indexes and, finally, 1,771 compositions with Abbe numbers. The extraction performed here increases the available information by approximately 10.4% for liquidus temperature, 6.6% for refractive index, and 4.9% for Abbe number. The impact extends beyond quantity: the newly extracted data introduce compositions with property values that are more diverse than those in existing databases, thereby expanding the accessible compositional and property space for glass modeling applications. We emphasize that the compositions of the new database contain relatively more titanium, magnesium, zirconium, niobium, iron, tin, and yttrium oxides than those of the existing bases.

翻译：本研究介绍了从专利表格中提取信息的网络爬取技术，并对数据进行清洗与结构化处理，以便未来用于开发新型玻璃的预测性机器学习模型。我们提取了与新型玻璃开发相关的成分及三项性能参数，并将其构建为数据库，以便与其他可用数据集的信息结合使用。同时，我们分析了所获信息的一致性及其对现有数据库的补充价值。提取的液相线温度数据涵盖5,696种成分；第二子集包含4,298个折射率数据；最后，1,771种成分包含阿贝数。本次数据提取使液相线温度、折射率和阿贝数的可用信息量分别增加了约10.4%、6.6%和4.9%。其影响不仅体现在数量上：新提取的数据引入了比现有数据库更具多样性的性能值成分，从而拓展了玻璃建模应用可访问的成分与性能空间。需要强调的是，新数据库中的成分相较于现有数据库，含有相对更高比例的钛、镁、锆、铌、铁、锡和钇的氧化物。