Please wait a minute...
JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE)
    
Quality evaluation algorithm for conflicting data sources based on true value finding
WANG Ji-kui1, LI Shao-bo1,2
1. Chengdu Institute of Computer Applications, Chinese Academy of Sciences, Chengdu 610041, China; 2. Key Laboratory of Advanced Manufacturing Technology of Ministry of Education Guizhou University, Guiyang 550003, China
Download: HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

Existing evaluating models for conflicting data sources usually take nothing but accuracy and precision into account, ignoring different impacts to the quality of data sources caused by false data values and empty values. In this paper, false descriptions provided by data sources were defined as initiative errors, while empty values were defined as passive errors. A new quality evaluating model was constructed, in which accuracy and precision were respectively substituted by sensitivity and specificity. Multiple descriptions from different sources were merged and a notion of inclusion relation as well as a calculating model for inclusion degrees was proposed as pretreatments to deal with multi-value problems. An evaluating algorithm TFDQ for conflicting data source quality based on the calculating model was put forward. Experiments on the universal data set Books-Authors show that the result from TFDQ is closer to the reality comparing to the classic Vote and TruthFinder algorithms.



Published: 01 February 2015
CLC:  TP 311  
Cite this article:

WANG Ji-kui, LI Shao-bo. Quality evaluation algorithm for conflicting data sources based on true value finding. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(2): 303-318.

URL:

http://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2015.02.016     OR     http://www.zjujournals.com/eng/Y2015/V49/I2/303


基于真值发现的冲突数据源质量评价算法

针对目前冲突数据源的质量评价模型仅考虑准确度与精确度2个方面,没有考虑数据源提供错误描述与提供空值对数据源质量会产生不同影响的情况,通过将数据源提供的错误描述定义为主动错误,并将数据源没有为实体提供描述定义为被动错误,从主动错误、被动错误2个方面建立数据源质量模型.该模型以敏感度、明确度代替了准确度与精确度;为了处理多真值问题,预先合并数据源对实体的描述,并定义了合并描述的包含关系及包含度计算模型;在包含度计算模型的基础上,提出了基于描述包含度的冲突数据源质量评价算法(TFDQ).在通用数据集Books-Authors上的实验表明,与Vote算法、TruthFinder算法相比,TFDQ算法实验结果更接近真实情况.

[1] BLEIHOLDER J, NAUMANN F. Conflict handling strategies in an integrated information system [C]∥ In Proceedings of the IJCAI Workshop on Information on the Web. Edinburgh, Scotland, UK: ACM, 2006: 16.
[2] ABOULNAGA A, El GEBALY K. μbe: User guided source selection and schema mediation for internet scale data integration [C]∥In Proceedings ofInternational Conference on Data Engineering. Istanbul, Turkey: ACM,2007: 186-195.
 [3] 万常选, 邓松, 刘喜平, 等. Web 数据源选择技术[J]. 软件学报, 2013, 24(4): 781-797.
WAN Chang-xuan, DENG Song, LIU Xi-ping, et al. Web data source selection technologies [J]. Journal of Software, 2013, 24(4): 781-797.
[4] YIN X, HAN J, YU P S. Truth discovery with multiple conflicting information providers on the web [J]. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(6): 796-808.
[5] DONG X L, BERTI-EQUILLE L, SRIVASTAVA D. Integrating conflicting data: the role of source dependence [C]∥In Proceedings of the VLDB Endowment. Lyon, France: ACM , 2009, 2(1): 550-561.
[6] DONG X L, BERTI-EQUILLE L, SRIVASTAVA D. Truth discovery and copying detection in a dynamic world [C]∥ In Proceedings of the VLDB Endowment. Lyon, France: ACM , 2009, 2(1): 562-573.
[7] 张志强, 刘丽霞, 谢晓芹, 等. 基于数据源依赖关系的信息评价方法研究[J]. 计算机学报, 2012, 35(11): 2392-2402.
ZHANG Zhi-qiang, LIU Li-xia, XIE Xiao-qin, et al. Information evaluation based on source dependence [J]. Chinese Journal of Computers, 2012, 35(11): 2392-2402.
[8] 考明军, 张炜, 高宏. 冲突数据中的真值发现算法[J]. 计算机研究与发展, 2010, 47(增刊): 188-192.
KAO Ming-jun, ZHANG Wei, GAO Hong. Truth Discovery methods in conflict data integration [J]. Journal of Computer Research and Development, 2010,47(Supplement): 188-192.
[9] GALLAND A, ABITEBOUL S, MARIAN A, et al. Corroborating information from disagreeing views [C]∥In Proceedings of the third ACM International Conference on Web Search And Data Mining. ACM, 2010: 131-140.
[10] ZHAO B, RUBINSTEIN B I P, GEMMELL J, et al. A Bayesian approach to discovering truth from conflicting sources for data integration [C]∥ In Proceedings of the VLDB Endowment. Istanbul, Turkey :ACM,2012, 5(6): 550-561.
[11] 仇国芳,李怀祖.模糊性偏序关系上的信息融合[J].工程数学学报, 2002, 19(1): 37-45.
QIU Guo-fang, LI Huai-zu. Information aggregation based on fuzzy preference relations [J]. Chinese Journal Of Engineering Mathematics, 2002, 19(1): 37-45.
[12] 曲开社, 翟岩慧. 偏序集, 包含度与形式概念分析[J]. 计算机学报, 2006, 29(2): 219-226.
QU Kai-she, ZHAI Yan-hui. Posets,inclusion degree theory and FCA [J]. Chinese Journal of Computers , 2006.29(2): 219-226.
[1] YUAN You-wei-, YU Jia, ZHENG Hong-sheng, WANG Jiao-jiao. Cloud workflow scheduling algorithm based on novelty ranking and multi-quality of service[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(6): 1190-1196.
[2] XU Rong-bin, SHI Jun, ZHANG Peng-fei, XIE Ying. Similarity measurement of transition mapping relation using Petri net[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(6): 1205-1213.
[3] WANG Haiyan, CHENG Yan . Dual service selection method based on coefficient of variation[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(6): 1197-1204.
[4] CHANG Chao, LIU Ke-sheng, TAN Long-dan, JIA Wen-chao. Data flow analysis for C program based on graph model[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(5): 1007-1015.
[5] WANG Ji kui . Bayesian conflicting Web data credibility algorithm[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2016, 50(12): 2380-2385.
[6] TU Ding, CHEN Ling, CHEN Gen cai, WU Yong, WANG Jing chang. Hierarchical online NMF for detecting and tracking topics[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2016, 50(8): 1618-1626.
[7] YANG Sha, YE Zhen yu, WANG Shu gang, TAO Hai, LI Shi jian. Perception enhanced intelligent robotic arm system[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2016, 50(6): 1155-1159.
[8] LUO Lin, SU Hong ye, BAN Lan. Nonparametric bayesian based on  mixture of dirichlet process in application of fault detection[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(11): 2230-2236.
[9] WANG Hong-hao, WANG Hui-quan, JIN Zhong-he. Rollback-able on-board software upgrade method based on incremental link[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(4): 724-731.
[10] CAI Hua-lin, CHEN Gang, CHEN Ke. Spatial matching on multi-type resource[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(1): 69-78.
[11] YU Dong-jin, YIN Yu-yu, WU Meng-meng, LIU Yu. QoS prediction for Web services based on hybrid collaborative filtering[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2014, 48(11): 2039-2045.
[12] KE Hai-feng, YING Jing. Real-time license character recognition technology based on R-ELM[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2014, 48(7): 1209-1216.
[13] LIU Zhi-hui, ZHANG Quan-ling. Research overview of big data technology[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2014, 48(6): 957-972.
[14] TIAN Tian,GONG Dun-wei. Evolutionary generation of test data for path coverage through selecting target paths based on coverage difficulty[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2014, 48(5): 948-994.
[15] KE Hai-feng, YING Jing. Real-time license character recognition technology based on R-ELM[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2014, 48(2): 0-0.