Please wait a minute...
浙江大学学报(工学版)  2022, Vol. 56 Issue (2): 245-253    DOI: 10.3785/j.issn.1008-973X.2022.02.004
计算机与控制工程     
铝硅合金实体关系抽取数据集的构建方法
刘英莉1,2(),吴瑞刚1,2,么长慧1,2,沈韬1,2,*()
1. 昆明理工大学 信息工程与自动化学院,云南 昆明 650500
2. 昆明理工大学 云南省计算机技术应用重点实验室,云南 昆明 650500
Construction method of extraction dataset of Al-Si alloy entity relationship
Ying-li LIU1,2(),Rui-gang WU1,2,Chang-hui YAO1,2,Tao SHEN1,2,*()
1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
2. Yunnan Key Laboratory of Computer Technologies Application, Kunming University of Science and Technology, Kunming 650500, China
 全文: PDF(1157 KB)   HTML
摘要:

针对材料领域没有适合材料实体关系抽取技术研究工作的公开数据集这一问题,通过研究高硅铝合金喷射沉积文献提出铝硅合金实体关系抽取数据集的构建方法. 在材料领域专家的指导下制定铝硅合金实体关系抽取数据集的构建标准,并根据构建标准对收集的数据进行实体标注和关系标注. 在标注完成后,通过数据预处理生成铝硅合金实体关系抽取数据集. 通过实体关系联合抽取模型进行实验,验证该数据集可以应用于实体关系抽取任务. 与公开数据集相比,材料数据集句子的语义和语法更为复杂,长句更多,导致实体关系联合抽取模型在材料数据集上的表现略差. 针对上述问题,在实体关系联合抽取模型上加入自注意力机制,使该模型整体的F1值提高了约5.8%. 该数据集的构建方法具有普适性,可以通过该构建方法构建材料数据集.

关键词: 数据集构建标准数据标注实体关系联合抽取模型自注意力机制    
Abstract:

At present, there is no public dataset suitable for the research work of material entity relationship extraction technology in the field of materials. Aiming at the above problem, the construction method of aluminum-silicon alloy entity relationship extraction dataset was proposed through the literature of high-silicon aluminum alloy spray deposition. The construction standards of the aluminum-silicon alloy entity relationship extraction dataset were formulated under the guidance of experts in the material field, and the collected data were marked with entities and relationships according to the construction standards. After the annotation was completed, the aluminum-silicon alloy entity relationship extraction dataset was generated through data preprocessing. Experiments were conducted through the entity-relationship joint extraction model to verify that the dataset can be applied to entity-relationship extraction tasks. Compared with the public dataset, the semantics and grammar of the sentence in the material dataset were more complicated, and there were more long sentences, which led to a slightly worse performance of the entity relationship joint extraction model on the material dataset. Therefore, a self-attention mechanism was added to the entity relationship joint extraction model, which increased the overall F1 value by about 5.8%. The method of constructing the dataset is universal, and the material dataset can be constructed by the construction method.

Key words: dataset    construction standard    data annotation    entity relationship joint extraction model    self-attention mechanism
收稿日期: 2021-07-12 出版日期: 2022-03-03
CLC:  TP 391.1  
基金资助: 国家自然科学基金资助项目(52061020,61971208,51864027);云南计算机技术应用重点实验室开放基金资助项目(2020103)
通讯作者: 沈韬     E-mail: lyl2002@126.com;shentao@kust.edu.cn
作者简介: 刘英莉(1978—),女,副教授,从事机器学习、自然语言处理和材料基因组计划研究. orcid.org/0000-0003-0298-9257. E-mail: lyl2002@126.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
刘英莉
吴瑞刚
么长慧
沈韬

引用本文:

刘英莉,吴瑞刚,么长慧,沈韬. 铝硅合金实体关系抽取数据集的构建方法[J]. 浙江大学学报(工学版), 2022, 56(2): 245-253.

Ying-li LIU,Rui-gang WU,Chang-hui YAO,Tao SHEN. Construction method of extraction dataset of Al-Si alloy entity relationship. Journal of ZheJiang University (Engineering Science), 2022, 56(2): 245-253.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2022.02.004        https://www.zjujournals.com/eng/CN/Y2022/V56/I2/245

数据集 Ne Nr
CoNLL-2004 4 5
ACE04 7 7
ADE 2 1
DERC 9 2
表 1  公开数据集实体数量和关系数量的对比
图 1  铝硅合金实体关系抽取数据集整体框架图
性能 关键词 英文全称 缩写
拉伸 抗拉强度 tensile strength UTS/Rm/σb
延伸率 elongation EL/δ
硬度 维氏硬度 Vickers hardness HV
布氏硬度 Brinell hardness HB
洛氏硬度 Rockwell hardness HR
微观组织 透射 transmission electron microscope TEM
扫描 scanning electron microscopy SEM
光学显微镜 optical microscope OM
电子背散射衍射 electron back scattering diffraction EBSD
热膨胀系数 ? coefficient of thermal expansion CTE
表 2  铝硅合金主要关注的部分性能
图 2  实体类型及关系类型结构图
图 3  Brat手工标注界面
实体 英文 标注格式
元素 Element Ele
含量 Content Con
合金 Alloy Alloy
实验 Experiment Exp
实验结果 Experiment_result Exp_r
参数名 Parameter_n Par_n
参数值 Parameter_v Par_v
测试名 Test_n Test_n
测试值 Test_v Test_v
测试图 Test_f Test_f
Phase Phase
表 3  Brat中实体的标注格式
关系类型 关系 实体1 实体2 英文 标注
成分 含量-元素 含量 元素 Content-Element Con-Ele
成分 元素-合金 元素 合金 Element-Alloy Ele-Alloy
实验 合金-实验 合金 实验 Alloy-Experiment Alloy-Exp
实验 实验-实验结果 实验 实验结果 Experiment-Experiment_result Exp-Exp_r
实验 实验结果-参数名 实验结果 参数名 Experiment_result-Parameter_n Exp_r-Par_n
实验 实验-参数名 实验 参数名 Experiment-Parameter_n Exp-Par_n
测试 合金-测试名 合金 测试名 Alloy-Test_n Alloy-Test_n
测试 测试名-参数名 测试名 参数名 Test_n-Parameter_n Test_n-Par_n
测试 测试名-测试值 测试名 测试值 Test_n-Test_v Test_n-Test_v
测试 测试名-测试图 测试名 测试图 Test_n-Test_f Test_n-Test_f
测试 测试名-相 测试名 Test_n-Phase Test_n-Phase
测试 相-测试值 测试值 Phase-Test_v Phase-Test_v
参数 参数名-参数值 参数名 参数值 Parameter_n-Parameter_v Par_n-Par_v
表 4  Brat中关系的标注格式
图 4  关系抽取模型读取的数据集格式
数据集 Ns Ne Nr
CoNLL-2004 1441 5347 2020
Al-Si合金关系抽取数据集 2246 2522 1510
表 5  数据集数据量对比
图 5  数据集句子长度对比
实体类型 示例
含量 However, 3% Mn addition leads to a substantial improvement of the tensile strengths at elevated temperature.
元素 However, 3% Mn addition leads to a substantial improvement of the tensile strengths at elevated temperature.
合金 In this paper, Al-20Si-3Cu-1Mg alloy was prepared by spray deposition technique.
实验 In this paper, Al-20Si-3Cu-1Mg alloy was prepared by spray deposition technique.
实验结果 Cylindrical samples of 25 mm in diameter and 50 mm in length were machined out from each spray deposit obtained.
测试名 Fig.1(c) shows the scanning electron microscopy microstructure of the spray-deposited A2 alloy.
测试值 From table 2, with additions of 5% Fe to Al-20Si-3Cu-1Mg alloy, both the yield and ultimate tensile strengths was increased 55 and 64 MPa at room temperature.
测试图 Fig.1(c) shows the scanning electron microscopy microstructure of the spray-deposited A2 alloy.
The formation of b-Al5FeSi phase in this transformation is known to be very slow.
参数名 The testing temperatures were 298, 473 and 573 K.
参数值 The testing temperatures were 298, 473 and 573 K.
表 6  铝硅合金关系抽取数据集的实体类型示例
关系类型 示例
成分 1.(含量-元素)However, 3%(1)Mn(2) addition leads to a substantial improvement of the tensile strengths at elevated temperature.
2.(元素-合金)Effect of Fe(1) and Mn(1) additions on microstructure and mechanical properties of spray-deposition Al-20Si-3Cu-1Mg alloy(2).
实验 1.(合金-实验)In this paper, Al-20Si-3Cu-1Mg alloy(1) was prepared by spray deposition(2) technique.
2.(实验-实验结果)Cylindrical samples(2) of 25 mm in diameter and 50 mm in length were machined out from each spray deposit obtained(1).
3.(实验-参数名)The solidification process(1) during spray-deposition occurs in two stages: gas atomization (rapid cooling)(2)and droplet consolidation (relatively slow cooling)(2).
4.(实验结果-参数名)The starting carbon particle(2) were as large as 50 mm, while the TiC particles are less than 0.7 mm in reacted preforms(1).
测试 1.(合金-测试名)The UTSs(2) of the TiC/Al(1) composites were improved over that of the unreinforced Al matrix
2.(测试名-参数名)As shown in Table 2, the UTS(1) of the TiC/Al composites at room temperature(2) was improved over that of the unreinforced Al matrix.
3.(测试名-测试值)From table 2, with additions of 5% Fe to Al-20Si-3Cu-1Mg alloy, both the yield(1) and ultimate tensile strengths(1) was increased 55(2)and 64 MPa(2) at room temperature.
4.(测试名-测试图)Fig.1(c)(2) shows the scanning electron microscopy microstructure(1) of the spray-deposited A2 alloy.
5.(测试名-相)The high volume fraction of metastable d-Al4FeSi2 phase(2) in the spray-deposited microstructure(1) may be attributed to two primary reasons.
6.(相-测试值)The microstructure of the as-deposited alloy is composed of primary Si (1) with an average size of 12.5 μm(2) and secondary Al phase.
参数 1.(参数名-参数值)The testing temperatures(1) were 298(2), 473(2) and 573 K(2).
表 7  铝硅合金关系抽取数据集的关系类型示例
安装包 版本 安装包 版本
CUDA 10.2 numpy 1.19.4
CuDNN 7.6.5 sklearn 0.22
Python 3.6.12 prettytalbe 0.7.0
Tensorflow 1.15.0 pandas 0.24.2
gensim 3.4.0 ? ?
表 8  实验运行环境及版本号
数据集 NER任务 RE任务 总体
Pr Re F1 Pr Re F1 F1
本研究数据集 66.2 61.7 63.9 53.5 44.2 48.5 56.2
CoNLL-2004 67.7 68.7 68.2 54.1 47.8 54.8 61.5
表 9  数据集实验结果对比
模型 NER任务 RE任务 总体 F1
Pr Re F1 Pr Re F1
att_Multi_head 71.3 64.4 67.7 65.2 49.5 56.3 62.0
Multi-head 66.2 61.7 63.9 53.5 44.2 48.5 56.2
对比 +5.1 +2.7 +3.8 +11.75 +5.3 +7.8 +5.8
表 10  Multi-head模型与att_Multi_head模型实验结果对比
实体 TP FP FN Pr/% Re/% F1/%
Con(含量) 15 0 0 1.00 1.00 1.00
Ele(元素) 18 5 3 78.26 85.71 81.81
Alloy(合金) 37 8 10 82.22 78.72 80.43
Exp(实验) 36 13 13 73.46 73.46 73.46
Exp_r(实验结果) 0 0 1 0 0 0
Test_n(测试名) 52 23 29 69.33 64.19 66.66
Test_v(测试值) 6 7 12 46.15 33.33 38.70
Test_f(测试图) 18 9 9 66.66 66.66 66.66
Phase(相) 31 9 14 77.50 68.88 72.94
Par_n(参数名) 13 10 23 56.52 36.11 44.06
Par_v(参数值) 13 11 18 54.16 41.93 47.27
总计 239 95 132 71.34 64.44 67.71
表 11  NER任务中各类实体的实验结果
关系 TP FP FN Pr/% Re/% F1/%
composition(成分) 22 7 9 75.86 70.96 73.33
experiment(实验) 29 12 17 70.73 63.04 66.66
test(测试) 50 30 63 62.50 44.24 51.81
parameter(参数) 8 9 23 54.16 41.93 47.27
总计 109 58 112 65.27 49.55 56.33
表 12  RE任务中各类关系的实验结果
1 NOSENGO N, CEDER G Can artificial intelligence create the next wonder material?[J]. Nature, 2016, 533 (7601): 22- 25
doi: 10.1038/533022a
2 WANG Y, SEO B, WANG B, et al Fundamentals, materials, and machine learning of polymer electrolyte membrane fuel cell technology[J]. Energy and AI, 2020, 1: 100014
doi: 10.1016/j.egyai.2020.100014
3 JABLONKA K M, ONGARI D, MOOSAVI S M, et al Big-data science in porous materials: materials genomics and machine learning[J]. Chemical Reviews, 2020, 120 (16): 8066- 8129
doi: 10.1021/acs.chemrev.0c00004
4 GREEN M, CHOI C, HATTRICK-SIMPERS J, et al Fulfilling the promise of the materials genome initiative with high-throughput experimental methodologies[J]. Applied Physics Reviews, 2017, 4 (1): 011105
doi: 10.1063/1.4977487
5 KIM E, HUANG K, SAUNDERS A, et al Materials synthesis insights from scientific literature via text extraction and machine learning[J]. Chemistry of Materials, 2017, 29: 9436- 9444
doi: 10.1021/acs.chemmater.7b03500
6 RACCUGLIA P, ELBERT K C, ADLER P, et al Machine-learning-assisted materials discovery using failed experiments[J]. Nature, 2016, 533 (7601): 73- 76
doi: 10.1038/nature17439
7 TIAN C, CHEN G Y, YANG L, et al Microstructures and properties of Si-Al alloy for electronic packaging prepared by spray deposition technique[J]. Journal of Functional Materials and Devices, 2006, 12 (1): 54- 58
8 BEKOULIS G, DELEU J, DEMEESTER T, et al Joint entity recognition and relation extraction as a multi-head selection problem[J]. Expert Systems with Application, 2018, 114: 34- 45
doi: 10.1016/j.eswa.2018.07.032
9 SANG E T K J A Introduction to the CoNLL-2002 shared task: language-independent named entity recognition[J]. Computer Science, 2002, 20: 1- 4
10 CARRERAS X. Introduction to the CoNLL-2004 shared task: semantic role labeling[C]// Proceedings of the 8th Conference on Computational Natural Language Learning at HLT-NAACL 2004. Boston: [s.n.], 2004: 89-97.
11 BARRY P, HENRY S, YETISGEN M, et al. Jointly learning clinical entities and relations with contextual language models and explicit context[EB/OL]. [2021-07-01]. https://arxiv.org/abs/2102.11031.
12 DODDIOGTON G, MITCHELL A, PRIZYBOCKI M A, et al. The automatic content extraction (ACE) program: tasks, data, and evaluation [EB/OL]. [2021-07-01]. http://www.lrec-conf.org/proceedings/lrec2004/pdf/5.pdf.
13 KONONOVA O, HUO H, HE T, et al Author correction: text-mined dataset of inorganic materials synthesis recipes[J]. Scientific Data, 2019, 6 (1): 273
doi: 10.1038/s41597-019-0297-x
14 LI Z, YANG Z, XIANG Y, et al Exploiting sequence labeling framework to extract document-level relations from biomedical texts[J]. BMC Bioinformatics, 2020, 21 (1): 1- 14
doi: 10.1186/s12859-019-3325-0
15 GURULINGAPPA H, RAJPUT A, ROBERTS A, et al Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports[J]. Journal of Biomedical Informatics, 2012, 45 (5): 885- 892
doi: 10.1016/j.jbi.2012.04.008
[1] 鞠晓臣,赵欣欣,钱胜胜. 基于自注意力机制的桥梁螺栓检测算法[J]. 浙江大学学报(工学版), 2022, 56(5): 901-908.
[2] 奚悦,张万斌,李培楠,刘保林,胥犇,李晓军. 城市地下空间资源质量三维精细化评价[J]. 浙江大学学报(工学版), 2022, 56(4): 656-663, 710.
[3] 徐兵,刘潇,汪子扬,刘飞虎,梁军. 采用梯度提升决策树的车辆换道融合决策模型[J]. 浙江大学学报(工学版), 2019, 53(6): 1171-1181.
[4] 徐照,张路,索华,迟英姿. 基于工业基础类的建筑物3D Tiles数据可视化[J]. 浙江大学学报(工学版), 2019, 53(6): 1047-1056.
[5] 王继奎, 李少波. 基于真值发现的冲突数据源质量评价算法[J]. 浙江大学学报(工学版), 2015, 49(2): 303-318.
[6] 应征, 王青, 李江雄, 柯映林,孙文博,韩永伟. 飞机数字化装配系统运动数据集成及监控技术[J]. J4, 2013, 47(5): 761-767.
[7] 刘琦 张引 俞荣栋 王明怡 叶修梓. 高密度寡核苷酸阵列的数据标准化方法[J]. J4, 2008, 42(9): 1653-1660.