Please wait a minute...
浙江大学学报(工学版)
机械工程     
基于真值发现的冲突数据源质量评价算法
王继奎1, 李少波1,2
1. 中国科学院 成都计算机应用研究所,四川 成都,610041;2. 贵州大学 现代制造技术教育部重点实验室,贵州 贵阳,550003
Quality evaluation algorithm for conflicting data sources based on true value finding
WANG Ji-kui1, LI Shao-bo1,2
1. Chengdu Institute of Computer Applications, Chinese Academy of Sciences, Chengdu 610041, China; 2. Key Laboratory of Advanced Manufacturing Technology of Ministry of Education Guizhou University, Guiyang 550003, China
 全文: PDF(631 KB)   HTML
摘要:

针对目前冲突数据源的质量评价模型仅考虑准确度与精确度2个方面,没有考虑数据源提供错误描述与提供空值对数据源质量会产生不同影响的情况,通过将数据源提供的错误描述定义为主动错误,并将数据源没有为实体提供描述定义为被动错误,从主动错误、被动错误2个方面建立数据源质量模型.该模型以敏感度、明确度代替了准确度与精确度;为了处理多真值问题,预先合并数据源对实体的描述,并定义了合并描述的包含关系及包含度计算模型;在包含度计算模型的基础上,提出了基于描述包含度的冲突数据源质量评价算法(TFDQ).在通用数据集Books-Authors上的实验表明,与Vote算法、TruthFinder算法相比,TFDQ算法实验结果更接近真实情况.

Abstract:

Existing evaluating models for conflicting data sources usually take nothing but accuracy and precision into account, ignoring different impacts to the quality of data sources caused by false data values and empty values. In this paper, false descriptions provided by data sources were defined as initiative errors, while empty values were defined as passive errors. A new quality evaluating model was constructed, in which accuracy and precision were respectively substituted by sensitivity and specificity. Multiple descriptions from different sources were merged and a notion of inclusion relation as well as a calculating model for inclusion degrees was proposed as pretreatments to deal with multi-value problems. An evaluating algorithm TFDQ for conflicting data source quality based on the calculating model was put forward. Experiments on the universal data set Books-Authors show that the result from TFDQ is closer to the reality comparing to the classic Vote and TruthFinder algorithms.

出版日期: 2015-02-01
:  TP 311  
基金资助:

国家自然科学基金资助项目(51475097);国家“十二五”科技支撑计划项目(2012BAF12B14);贵州省科技资助项目(黔科合JZ字[2014]2001,黔科合计Z字[2012]4009)

通讯作者: 李少波,教授,博导     E-mail: lishaobo@gzu.edu.cn
作者简介: 王继奎(1978—),男,副教授.从事数据治理、数据集成、软件过程技术与方法研究.E-mail: wjkweb@163.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章  

引用本文:

王继奎, 李少波. 基于真值发现的冲突数据源质量评价算法[J]. 浙江大学学报(工学版), 10.3785/j.issn.1008-973X.2015.02.016.

WANG Ji-kui, LI Shao-bo. Quality evaluation algorithm for conflicting data sources based on true value finding. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 10.3785/j.issn.1008-973X.2015.02.016.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2015.02.016        http://www.zjujournals.com/eng/CN/Y2015/V49/I2/303

[1] BLEIHOLDER J, NAUMANN F. Conflict handling strategies in an integrated information system [C]∥ In Proceedings of the IJCAI Workshop on Information on the Web. Edinburgh, Scotland, UK: ACM, 2006: 16.
[2] ABOULNAGA A, El GEBALY K. μbe: User guided source selection and schema mediation for internet scale data integration [C]∥In Proceedings ofInternational Conference on Data Engineering. Istanbul, Turkey: ACM,2007: 186-195.
 [3] 万常选, 邓松, 刘喜平, 等. Web 数据源选择技术[J]. 软件学报, 2013, 24(4): 781-797.
WAN Chang-xuan, DENG Song, LIU Xi-ping, et al. Web data source selection technologies [J]. Journal of Software, 2013, 24(4): 781-797.
[4] YIN X, HAN J, YU P S. Truth discovery with multiple conflicting information providers on the web [J]. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(6): 796-808.
[5] DONG X L, BERTI-EQUILLE L, SRIVASTAVA D. Integrating conflicting data: the role of source dependence [C]∥In Proceedings of the VLDB Endowment. Lyon, France: ACM , 2009, 2(1): 550-561.
[6] DONG X L, BERTI-EQUILLE L, SRIVASTAVA D. Truth discovery and copying detection in a dynamic world [C]∥ In Proceedings of the VLDB Endowment. Lyon, France: ACM , 2009, 2(1): 562-573.
[7] 张志强, 刘丽霞, 谢晓芹, 等. 基于数据源依赖关系的信息评价方法研究[J]. 计算机学报, 2012, 35(11): 2392-2402.
ZHANG Zhi-qiang, LIU Li-xia, XIE Xiao-qin, et al. Information evaluation based on source dependence [J]. Chinese Journal of Computers, 2012, 35(11): 2392-2402.
[8] 考明军, 张炜, 高宏. 冲突数据中的真值发现算法[J]. 计算机研究与发展, 2010, 47(增刊): 188-192.
KAO Ming-jun, ZHANG Wei, GAO Hong. Truth Discovery methods in conflict data integration [J]. Journal of Computer Research and Development, 2010,47(Supplement): 188-192.
[9] GALLAND A, ABITEBOUL S, MARIAN A, et al. Corroborating information from disagreeing views [C]∥In Proceedings of the third ACM International Conference on Web Search And Data Mining. ACM, 2010: 131-140.
[10] ZHAO B, RUBINSTEIN B I P, GEMMELL J, et al. A Bayesian approach to discovering truth from conflicting sources for data integration [C]∥ In Proceedings of the VLDB Endowment. Istanbul, Turkey :ACM,2012, 5(6): 550-561.
[11] 仇国芳,李怀祖.模糊性偏序关系上的信息融合[J].工程数学学报, 2002, 19(1): 37-45.
QIU Guo-fang, LI Huai-zu. Information aggregation based on fuzzy preference relations [J]. Chinese Journal Of Engineering Mathematics, 2002, 19(1): 37-45.
[12] 曲开社, 翟岩慧. 偏序集, 包含度与形式概念分析[J]. 计算机学报, 2006, 29(2): 219-226.
QU Kai-she, ZHAI Yan-hui. Posets,inclusion degree theory and FCA [J]. Chinese Journal of Computers , 2006.29(2): 219-226.
[1] 袁友伟, 余佳, 郑宏升, 王娇娇. 基于新颖性排名和多服务质量的云工作流调度算法[J]. 浙江大学学报(工学版), 2017, 51(6): 1190-1196.
[2] 许荣斌, 石军, 张鹏飞, 谢莹. Petri网的映射变迁关系相似性度量[J]. 浙江大学学报(工学版), 2017, 51(6): 1205-1213.
[3] 王海艳, 程严. 基于离散系数的双向服务选择方法[J]. 浙江大学学报(工学版), 2017, 51(6): 1197-1204.
[4] 常超, 刘克胜, 谭龙丹, 贾文超. 基于图模型的C程序数据流分析[J]. 浙江大学学报(工学版), 2017, 51(5): 1007-1015.
[5] 王继奎. 贝叶斯冲突Web数据可信度算法[J]. 浙江大学学报(工学版), 2016, 50(12): 2380-2385.
[6] 涂鼎, 陈岭, 陈根才, 吴勇, 王敬昌. 基于在线层次化非负矩阵分解的文本流主题检测[J]. 浙江大学学报(工学版), 2016, 50(8): 1618-1626.
[7] 杨莎, 叶振宇, 王淑刚, 陶海, 李石坚, 潘纲, 朱斌. 感认知增强的智能机械手系统[J]. 浙江大学学报(工学版), 2016, 50(6): 1155-1159.
[8] 罗林, 苏宏业, 班岚. Dirichlet过程混合模型在非线性过程监控中的应用[J]. 浙江大学学报(工学版), 2015, 49(11): 2230-2236.
[9] 汪宏浩, 王慧泉, 金仲和. 基于增量链接的可回滚星载软件在轨更新方法[J]. 浙江大学学报(工学版), 2015, 49(4): 724-731.
[10] 蔡华林,陈刚,陈珂. 多类别复合资源的空间匹配[J]. 浙江大学学报(工学版), 2015, 49(1): 69-78.
[11] 俞东进,殷昱煜,吴萌萌,刘愉. 基于混合协同过滤的Web服务QoS预测方法[J]. 浙江大学学报(工学版), 2014, 48(11): 2039-2045.
[12] 柯海丰,应晶. 基于R-ELM的实时车牌字符识别技术[J]. 浙江大学学报(工学版), 2014, 48(7): 1209-1216.
[13] 刘智慧, 张泉灵. 大数据技术研究综述[J]. 浙江大学学报(工学版), 2014, 48(6): 957-972.
[14] 田甜,巩敦卫. 基于覆盖难度选择路径的测试数据进化生成[J]. 浙江大学学报(工学版), 2014, 48(5): 948-994.
[15] 柯海丰,应晶. 基于R-ELM的实时车牌字符识别技术[J]. J4, 2014, 48(2): 0-0.