Please wait a minute...
J4  2013, Vol. 47 Issue (1): 23-28    DOI: 10.3785/j.issn.1008-973X.2013.01.004
计算机技术﹑电信技术     
基于历史点击数据的集合选择方法
刘颖1, 陈岭1, 陈根才1, 赵江奇2, 王敬昌2
1.浙江大学 计算机科学与技术学院, 浙江 杭州 310027;2.浙江鸿程计算机系统有限公司, 浙江 杭州 310009
Approach for collection selection based on click-through data
LIU Ying1, CHEN Ling1, CHEN Gen-cai1, ZHAO Jiang-qi2, WANG Jing-chang2
1.College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China; 2.Zhejiang Hongcheng
Computer Systems Company Limited, Hangzhou 310009, China
 全文: PDF 
摘要:

针对分布式信息检索时不同信息集对最终检索结果贡献度有差异的现象,提出基于历史点击数据的集合选择方法(PCTD-CS).该方法利用点击数据估计各集合与历史查询的相关度.采用基于关键词和基于检索结果相结合的方法估计查询间的相似度.利用历史查询中的相似查询估计新查询与各集合的相关度,选择相关度最高的M个集合进行检索,给出要获取前k个文档的情况下各集合应当返回的文档数.采用召回率Rm、前n个检索结果的准确率P@n及平均准确率MAP对集合选择方法的性能进行验证.实验结果表明,采用PCTD-CS方法提高了检索结果的召回率和准确率,能够更准确地定位到包含相关文档多的集合.

关键词: 分布式信息检索集合选择相似查询点击数据    
Abstract:

An approach of collection selection based on click-through data (PCTD-CS) was proposed considering that collections have different contributions to the final retrieval results. Click-through data of past queries were utilized for estimating the relevance of each collection to the query. A term-based and results-based mixed approach was used to estimate the similarity between queries. Past similar queries were used to predict the relevance of collections to a specific user query. Then M collections with the highest relevance were selected for retrieving, and the number of documents each collection returned was determined when top k ranked results were required. Rm, P@n and MAP were used to verify the effectiveness of the new collection selection method. Experimental results demonstrated that PCTD-CS improved the accuracy and recall of search results. PCTD-CS was better at selecting collections with more relevant documents.

Key words: distributed information retrieval    collection selection    similar query    click-through data
出版日期: 2013-03-05
:  TP 311  
基金资助:

 国家“核高基”重大科技专项课题资助项目(2010ZX01042-002-003);国家自然科学基金资助项目(60703040);浙江省科技计划重大资助项目(2007C13019);浙江省重大科技专项资助项目(2011C13042);杭州市重大科技创新专项资助项目(20112311A20).

通讯作者: 陈岭,男,副教授.     E-mail: lingchen@cs.zju.edu.cn
作者简介: 刘颖(1988-),女,硕士生,从事分布式信息检索研究.E-mail:liuying20711@163.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章  

引用本文:

刘颖, 陈岭, 陈根才, 赵江奇, 王敬昌. 基于历史点击数据的集合选择方法[J]. J4, 2013, 47(1): 23-28.

LIU Ying, CHEN Ling, CHEN Gen-cai, ZHAO Jiang-qi, WANG Jing-chang. Approach for collection selection based on click-through data. J4, 2013, 47(1): 23-28.

链接本文:

http://www.zjujournals.com/xueshu/eng/CN/10.3785/j.issn.1008-973X.2013.01.004        http://www.zjujournals.com/xueshu/eng/CN/Y2013/V47/I1/23

[1] CALLAN J. Distributed information retrieval [M]. USA: Kluwer Academic Publishes, 2000: 127-150.
[2] CALLAN J, LU Z, CROFT W B. Searching distributed collection with inference networks [C] ∥ Proceeding of ACM SIGIR. Seattle, Washington: ACM, 1995: 21-28.
[3] SI L, JIN R, ALLAN J. et al. A language modeling framework for resource selection and results merging [C] ∥ Proceeding of ACM CIKM. McLean, Virginia: ACM, 2002: 391-397.
[4] SI L, CALLAN J. Relevant document distribution estimation method for resource selection [C] ∥ Proceeding of ACM SIGIR. Toronto, Canada: ACM, 2003: 298-305.
[5] RASOLOFO Y, ABBACI F, SAVOY J. Approaches to collection selection and results merging for distributed information retrieval [C] ∥ Proceeding of ACM CIKM. Atlanta: ACM, 2001: 191-198.
[6] PUPPIN D, SILVESTRI F, LAFORENZA D. Query-driven document partitioning and collection selection [C] ∥ Proceeding of the 1st INFOSCALE Conference. Hong Kong: ACM, 2006: Article 34.

[1] 柯海丰,应晶. 基于R-ELM的实时车牌字符识别技术[J]. J4, 2014, 48(2): 0-0.
[2] 金苍宏,吴明晖,应晶. 一种基于上下文索引的文本匹配框架[J]. J4, 2013, 47(9): 1537-1546.
[3] 朱凡微, 吴明晖, 应晶. 面向大规模无结构数据的Web方面搜索方法[J]. J4, 2013, 47(6): 990-999.
[4] 冯培恩, 刘屿, 邱清盈, 李立新. 提高Eclat算法效率的策略[J]. J4, 2013, 47(2): 223-230.
[5] 殷婷,肖敏,陈岭,赵江奇,王敬昌. 基于CQPM的OLAP查询日志挖掘及推荐[J]. J4, 2012, 46(11): 2052-2060.
[6] 肖敏, 陈岭, 夏海元, 陈根才. 基于数据仓库内在特征的OLAP关键词查询[J]. J4, 2012, 46(6): 974-979.
[7] 张丽平,李松,郝晓红,郝忠孝. Jrv粗糙Vague区域关系[J]. J4, 2012, 46(1): 105-111.
[8] 陈岭,许晓龙,杨清,陈根才. 基于三次样条插值的无线信号强度衰减模型[J]. J4, 2011, 45(9): 1521-1527.
[9] 吴明晖, 应晶. 业务过程建模及其形式化验证[J]. J4, 2011, 45(2): 280-287.
[10] 傅朝阳, 高济, 周尤明. 词法多重散列与包容语义相结合的服务查找[J]. J4, 2010, 44(12): 2274-2283.
[11] 杨清, 陈岭, 陈根才. 基于单加速度传感器的行走距离估计[J]. J4, 2010, 44(9): 1681-1686.
[12] 熊伟, 王晓暾. 基于质量功能展开的可信软件需求映射方法[J]. J4, 2010, 44(5): 881-886.
[13] 张引, 何浩, 赵丽娜, 张三元. 网构软件模型中的抽象状态机设计[J]. J4, 2010, 44(5): 923-929.
[14] 蒋涛, 应晶, 吴明晖, 等. 一种面向特征增量的软件产品线分析方法[J]. J4, 2009, 43(12): 2142-2148.
[15] 沈斌, 姚敏. 关联且项项正相关频繁模式挖掘[J]. J4, 2009, 43(12): 2171-2177.