Please wait a minute...
JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE)
Computer Technology     
Bayesian conflicting Web data credibility algorithm
WANG Ji kui
Key Laboratory of electronic commerce,Lanzhou University of Finance and Economics, Lanzhou 730000, China
Download:   PDF(667KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

The key of Web data fusion is to judge the credibility of conflicting Web data. The credibility of observing values was only rectified by their similarity relations, while the inclusion relations was ignored in existed algorithms. To solve this problem, the concept of inclusion degree was defined based on the characteristic analysis of the observing values; a modified model was proposed to rectify the credibility of observing values using inclusion degree was proposed. Taking the observing values as random variables, the reliability problem of the observation values could be attributed to a posterior probability distribution problem. Bayesian theory was adopted to define the conception of data source credibility, to derive the relationship model for data source credibility and observing value credibility, and to propose the Bayesian conflicting Web data credibility algorithm: DataCredibility. The experiment results show that the proposed DataCredibility algorithm achieves better accuracy, recall rate and F1 measure value compared with the baseline algorithms.



Published: 08 December 2016
CLC:  TP 311  
Cite this article:

WANG Ji kui . Bayesian conflicting Web data credibility algorithm. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2016, 50(12): 2380-2385.

URL:

http://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2016.12.019     OR     http://www.zjujournals.com/eng/Y2016/V50/I12/2380


贝叶斯冲突Web数据可信度算法

Web数据融合的关键是判定冲突Web数据的可信度.已有算法只采用观察值之间的相似关系而忽视包含关系对可信度进行修正,针对这一问题,在分析观察值本身特性的基础上定义观察值包含度,提出利用观察值包含度对观察值可信度进行修正的模型.将观察值看作随机变量,观察值的可信度问题可归结为观察值的后验概率分布问题.在贝叶斯分析的基础上,定义数据源可信度,推导出数据源可信度与观察值可信度之间的关系模型|并提出基于贝叶斯理论的冲突Web数据可信度算法DataCredibility.实验结果表明,与基准算法相比,DataCredibility获得了更高的精确度、召回率及F1测度值.

[1] 董永权.Deep Web数据集成关键问题研究 [D]. 济南:山东大学, 2010.
DONG Yongquan. Key problem of deep web data integration [D]. Jinan: Shandong University, 2010.
[2] FETTERLY D, MANASSE M, NAJORK M, et al. A largescale study of the evolution of web pages [C] ∥ In proceedings of the 12th international conference on World Wide Web. Budapest: ACM, 2003: 669678.
[3] CHANG K C C, HE B, LI C, et al. Structured databases on the web: observations and implications [J]. ACM Sigmod Record, 2004, 33(3): 6170.
[4] ENRIGHT A. Consumers trust information found online less than offline messages [J]. Internet Retailer, 2010, 25.
[5] DALVI N, MACHANAVAJJHALA A, PANG B. An analysis of structured data on the web [C] ∥ In proceedings of the 38th International Conference on Very Large DataBases. Istanbul: ACM, 2012: 680691.
[6] LI X, DONG X L, LYONS K, et al. Truth finding on the deep web: Is the problem solved? [C] ∥ In proceedings of the 38th International Conference on Very Large DataBases. Istanbul: ACM, 2012: 97108.
[7] YIN X X, HAN J W, YU P S. Truth discovery with multiple conflicting information providers on the Web [J]. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(6): 796808.
[8] GALLAND A, ABITEBOUL S, MARIAN A, et al. Corroborating information from disagreeing views [C] ∥ In proceedings of the third ACM international conference on Web search and data mining. New York: ACM, 2010: 131140.
[9]  PASTERNACK J, ROTH D. Knowing what to believe (when you already know something) [C] ∥ In proceedings of the International Conference on Computational Linguistics. Beijing: ACM, 2010: 877885.
[10] JSANG A, MARSH S, POPE S. Exploring different types of trust propagation [C] ∥ International Conference on Trust Management. Berlin Heidelberg: Springer, 2006: 179192.
[11] 张志强,刘丽霞,谢晓芹,等.基于数据源依赖关系的信息评价方法研究[J].计算机学报, 2012, 35(11):23922402.
ZHANG Zhiqiang, LIU Lixia, XIE Xiaoqin, et al. Information evaluation based on source dependence [J]. Chinese Journal of Computers, 2012, 35(11): 23922402.
[12] 张永新,李庆忠,彭朝晖.基于Markov逻辑网的两阶段数据冲突解决方法 [J]. 计算机学报, 2012, 35(1): 101111.
ZHANG Yongxin, LI Qingzhong, PENG Zhaohui. 2stage data conflict resolution based on Markov logic networks [J]. Chinese Journal of Computers, 2012, 35(1): 101111.
[13] ZHAO B, RUBINSTEIN B I, GEMMELL J, et al. A bayesian approach to discovering truth from conflicting sources for data integration [C] ∥ In proceedings of the 38th International Conference on Very Large DataBases. Istanbul: ACM, 2012, 5(6): 550561.
[14]  RAVALI P, ANISH D S, DSONG X L, et al. Fusing data with correlations [C] ∥ In proceedings of the SIGMOD, Snowbird: ACM, 2014: 433444.
[15] LI Q, LI Y, GAO J, et al. A confidenceaware approach for truth discovery on longtail data [C] ∥ In proceedings of the 41th International Conference on Very LargeDataBases. Kohala Coast: ACM, 2015: 425436.
[16] 曲开社,翟岩慧.偏序集,包含度与形式概念分析[J].计算机学报,2006,29(2): 219226.
QU Kaishe, ZHAI Yanhui. Posets, inclusion degree theory and FCA [J]. Chinese Journal of Computers, 2006.29(2): 219226.
[17]  YIN X, DONG L. Data sets for data fusion experiments (III. Book) [EB/OL]. (20121207) [20151001]. http:∥lunadong.com/ fusionDataSets.htm.

[1] YUAN You-wei-, YU Jia, ZHENG Hong-sheng, WANG Jiao-jiao. Cloud workflow scheduling algorithm based on novelty ranking and multi-quality of service[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(6): 1190-1196.
[2] XU Rong-bin, SHI Jun, ZHANG Peng-fei, XIE Ying. Similarity measurement of transition mapping relation using Petri net[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(6): 1205-1213.
[3] WANG Haiyan, CHENG Yan . Dual service selection method based on coefficient of variation[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(6): 1197-1204.
[4] CHANG Chao, LIU Ke-sheng, TAN Long-dan, JIA Wen-chao. Data flow analysis for C program based on graph model[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(5): 1007-1015.
[5] TU Ding, CHEN Ling, CHEN Gen cai, WU Yong, WANG Jing chang. Hierarchical online NMF for detecting and tracking topics[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2016, 50(8): 1618-1626.
[6] YANG Sha, YE Zhen yu, WANG Shu gang, TAO Hai, LI Shi jian. Perception enhanced intelligent robotic arm system[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2016, 50(6): 1155-1159.
[7] LUO Lin, SU Hong ye, BAN Lan. Nonparametric bayesian based on  mixture of dirichlet process in application of fault detection[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(11): 2230-2236.
[8] WANG Hong-hao, WANG Hui-quan, JIN Zhong-he. Rollback-able on-board software upgrade method based on incremental link[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(4): 724-731.
[9] WANG Ji-kui, LI Shao-bo. Quality evaluation algorithm for conflicting data sources based on true value finding[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(2): 303-318.
[10] CAI Hua-lin, CHEN Gang, CHEN Ke. Spatial matching on multi-type resource[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(1): 69-78.
[11] YU Dong-jin, YIN Yu-yu, WU Meng-meng, LIU Yu. QoS prediction for Web services based on hybrid collaborative filtering[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2014, 48(11): 2039-2045.
[12] KE Hai-feng, YING Jing. Real-time license character recognition technology based on R-ELM[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2014, 48(7): 1209-1216.
[13] LIU Zhi-hui, ZHANG Quan-ling. Research overview of big data technology[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2014, 48(6): 957-972.
[14] TIAN Tian,GONG Dun-wei. Evolutionary generation of test data for path coverage through selecting target paths based on coverage difficulty[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2014, 48(5): 948-994.
[15] KE Hai-feng, YING Jing. Real-time license character recognition technology based on R-ELM[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2014, 48(2): 0-0.