Please wait a minute...
浙江大学学报(工学版)
计算机科学     
贝叶斯冲突Web数据可信度算法
王继奎
兰州财经大学 电子商务综合重点实验室,甘肃 兰州 730020
Bayesian conflicting Web data credibility algorithm
WANG Ji kui
Key Laboratory of electronic commerce,Lanzhou University of Finance and Economics, Lanzhou 730000, China
 全文: PDF(667 KB)   HTML
摘要:

Web数据融合的关键是判定冲突Web数据的可信度.已有算法只采用观察值之间的相似关系而忽视包含关系对可信度进行修正,针对这一问题,在分析观察值本身特性的基础上定义观察值包含度,提出利用观察值包含度对观察值可信度进行修正的模型.将观察值看作随机变量,观察值的可信度问题可归结为观察值的后验概率分布问题.在贝叶斯分析的基础上,定义数据源可信度,推导出数据源可信度与观察值可信度之间的关系模型|并提出基于贝叶斯理论的冲突Web数据可信度算法DataCredibility.实验结果表明,与基准算法相比,DataCredibility获得了更高的精确度、召回率及F1测度值.

Abstract:

The key of Web data fusion is to judge the credibility of conflicting Web data. The credibility of observing values was only rectified by their similarity relations, while the inclusion relations was ignored in existed algorithms. To solve this problem, the concept of inclusion degree was defined based on the characteristic analysis of the observing values; a modified model was proposed to rectify the credibility of observing values using inclusion degree was proposed. Taking the observing values as random variables, the reliability problem of the observation values could be attributed to a posterior probability distribution problem. Bayesian theory was adopted to define the conception of data source credibility, to derive the relationship model for data source credibility and observing value credibility, and to propose the Bayesian conflicting Web data credibility algorithm: DataCredibility. The experiment results show that the proposed DataCredibility algorithm achieves better accuracy, recall rate and F1 measure value compared with the baseline algorithms.

出版日期: 2016-12-08
:  TP 311  
基金资助:

国家自然科学基金资助项目(51475097,61473194);国家社科基金资助项目(14GSD95);全国统计科研重点资助项目(2013LZ44);陇原创新人才扶持计划资助项目(14GSD95);甘肃省财政厅高校基本科研业务费资助项目(GZ14007, GZ14023).

作者简介: 王继奎(1978—),男,副教授,从事数据治理、数据集成、软件过程技术与方法研究. ORCID: 0000-0001-5926-7007. E-mail: wjkweb@163.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章  

引用本文:

王继奎. 贝叶斯冲突Web数据可信度算法[J]. 浙江大学学报(工学版), 10.3785/j.issn.1008-973X.2016.12.019.

WANG Ji kui . Bayesian conflicting Web data credibility algorithm. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 10.3785/j.issn.1008-973X.2016.12.019.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2016.12.019        http://www.zjujournals.com/eng/CN/Y2016/V50/I12/2380

[1] 董永权.Deep Web数据集成关键问题研究 [D]. 济南:山东大学, 2010.
DONG Yongquan. Key problem of deep web data integration [D]. Jinan: Shandong University, 2010.
[2] FETTERLY D, MANASSE M, NAJORK M, et al. A largescale study of the evolution of web pages [C] ∥ In proceedings of the 12th international conference on World Wide Web. Budapest: ACM, 2003: 669678.
[3] CHANG K C C, HE B, LI C, et al. Structured databases on the web: observations and implications [J]. ACM Sigmod Record, 2004, 33(3): 6170.
[4] ENRIGHT A. Consumers trust information found online less than offline messages [J]. Internet Retailer, 2010, 25.
[5] DALVI N, MACHANAVAJJHALA A, PANG B. An analysis of structured data on the web [C] ∥ In proceedings of the 38th International Conference on Very Large DataBases. Istanbul: ACM, 2012: 680691.
[6] LI X, DONG X L, LYONS K, et al. Truth finding on the deep web: Is the problem solved? [C] ∥ In proceedings of the 38th International Conference on Very Large DataBases. Istanbul: ACM, 2012: 97108.
[7] YIN X X, HAN J W, YU P S. Truth discovery with multiple conflicting information providers on the Web [J]. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(6): 796808.
[8] GALLAND A, ABITEBOUL S, MARIAN A, et al. Corroborating information from disagreeing views [C] ∥ In proceedings of the third ACM international conference on Web search and data mining. New York: ACM, 2010: 131140.
[9]  PASTERNACK J, ROTH D. Knowing what to believe (when you already know something) [C] ∥ In proceedings of the International Conference on Computational Linguistics. Beijing: ACM, 2010: 877885.
[10] JSANG A, MARSH S, POPE S. Exploring different types of trust propagation [C] ∥ International Conference on Trust Management. Berlin Heidelberg: Springer, 2006: 179192.
[11] 张志强,刘丽霞,谢晓芹,等.基于数据源依赖关系的信息评价方法研究[J].计算机学报, 2012, 35(11):23922402.
ZHANG Zhiqiang, LIU Lixia, XIE Xiaoqin, et al. Information evaluation based on source dependence [J]. Chinese Journal of Computers, 2012, 35(11): 23922402.
[12] 张永新,李庆忠,彭朝晖.基于Markov逻辑网的两阶段数据冲突解决方法 [J]. 计算机学报, 2012, 35(1): 101111.
ZHANG Yongxin, LI Qingzhong, PENG Zhaohui. 2stage data conflict resolution based on Markov logic networks [J]. Chinese Journal of Computers, 2012, 35(1): 101111.
[13] ZHAO B, RUBINSTEIN B I, GEMMELL J, et al. A bayesian approach to discovering truth from conflicting sources for data integration [C] ∥ In proceedings of the 38th International Conference on Very Large DataBases. Istanbul: ACM, 2012, 5(6): 550561.
[14]  RAVALI P, ANISH D S, DSONG X L, et al. Fusing data with correlations [C] ∥ In proceedings of the SIGMOD, Snowbird: ACM, 2014: 433444.
[15] LI Q, LI Y, GAO J, et al. A confidenceaware approach for truth discovery on longtail data [C] ∥ In proceedings of the 41th International Conference on Very LargeDataBases. Kohala Coast: ACM, 2015: 425436.
[16] 曲开社,翟岩慧.偏序集,包含度与形式概念分析[J].计算机学报,2006,29(2): 219226.
QU Kaishe, ZHAI Yanhui. Posets, inclusion degree theory and FCA [J]. Chinese Journal of Computers, 2006.29(2): 219226.
[17]  YIN X, DONG L. Data sets for data fusion experiments (III. Book) [EB/OL]. (20121207) [20151001]. http:∥lunadong.com/ fusionDataSets.htm.

[1] 王海艳, 程严. 基于离散系数的双向服务选择方法[J]. 浙江大学学报(工学版), 2017, 51(6): 1197-1204.
[2] 袁友伟, 余佳, 郑宏升, 王娇娇. 基于新颖性排名和多服务质量的云工作流调度算法[J]. 浙江大学学报(工学版), 2017, 51(6): 1190-1196.
[3] 许荣斌, 石军, 张鹏飞, 谢莹. Petri网的映射变迁关系相似性度量[J]. 浙江大学学报(工学版), 2017, 51(6): 1205-1213.
[4] 常超, 刘克胜, 谭龙丹, 贾文超. 基于图模型的C程序数据流分析[J]. 浙江大学学报(工学版), 2017, 51(5): 1007-1015.
[5] 涂鼎, 陈岭, 陈根才, 吴勇, 王敬昌. 基于在线层次化非负矩阵分解的文本流主题检测[J]. 浙江大学学报(工学版), 2016, 50(8): 1618-1626.
[6] 杨莎, 叶振宇, 王淑刚, 陶海, 李石坚, 潘纲, 朱斌. 感认知增强的智能机械手系统[J]. 浙江大学学报(工学版), 2016, 50(6): 1155-1159.
[7] 罗林, 苏宏业, 班岚. Dirichlet过程混合模型在非线性过程监控中的应用[J]. 浙江大学学报(工学版), 2015, 49(11): 2230-2236.
[8] 汪宏浩, 王慧泉, 金仲和. 基于增量链接的可回滚星载软件在轨更新方法[J]. 浙江大学学报(工学版), 2015, 49(4): 724-731.
[9] 王继奎, 李少波. 基于真值发现的冲突数据源质量评价算法[J]. 浙江大学学报(工学版), 2015, 49(2): 303-318.
[10] 蔡华林,陈刚,陈珂. 多类别复合资源的空间匹配[J]. 浙江大学学报(工学版), 2015, 49(1): 69-78.
[11] 俞东进,殷昱煜,吴萌萌,刘愉. 基于混合协同过滤的Web服务QoS预测方法[J]. 浙江大学学报(工学版), 2014, 48(11): 2039-2045.
[12] 柯海丰,应晶. 基于R-ELM的实时车牌字符识别技术[J]. 浙江大学学报(工学版), 2014, 48(7): 1209-1216.
[13] 刘智慧, 张泉灵. 大数据技术研究综述[J]. 浙江大学学报(工学版), 2014, 48(6): 957-972.
[14] 田甜,巩敦卫. 基于覆盖难度选择路径的测试数据进化生成[J]. 浙江大学学报(工学版), 2014, 48(5): 948-994.
[15] 柯海丰,应晶. 基于R-ELM的实时车牌字符识别技术[J]. J4, 2014, 48(2): 0-0.