Please wait a minute...
Front. Inform. Technol. Electron. Eng.  2016, Vol. 17 Issue (1): 1-14    DOI: 10.1631/FITEE.1500190
Original article     
Efficient dynamic pruning on largest scores first (LSF) retrieval
Kun JIANG(),Yue-xiang YANG
College of Computer, National University of Defense Technology, Changsha 410073, China
Download: HTML     PDF(451KB)
Export: BibTeX | EndNote (RIS)      

Abstract  

Inverted index traversal techniques have been studied in addressing the query processing performance challenges of web search engines, but still leave much room for improvement. In this paper, we focus on the inverted index traversal on document-sorted indexes and the optimization technique called dynamic pruning, which can efficiently reduce the hardware computational resources required. We propose another novel exhaustive index traversal scheme called largest scores first (LSF) retrieval, in which the candidates are first selected in the posting list of important query terms with the largest upper bound scores and then fully scored with the contribution of the remaining query terms. The scheme can effectively reduce the memory consumption of existing term-at-a-time (TAAT) and the candidate selection cost of existing document-at-a-time (DAAT) retrieval at the expense of revisiting the posting lists of the remaining query terms. Preliminary analysis and implementation show comparable performance between LSF and the two well-known baselines. To further reduce the number of postings that need to be revisited, we present efficient rank safe dynamic pruning techniques based on LSF, including two important optimizations called list omitting (LSF_LO) and partial scoring (LSF_PS) that make full use of query term importance. Finally, experimental results with the TREC GOV2 collection show that our new index traversal approaches reduce the query latency by almost 27% over the WAND baseline and produce slightly better results compared with the MaxScore baseline, while returning the same results as exhaustive evaluation.



Key wordsInverted index      Index traversal      Query latency      Largest scores first (LSF) retrieval      Dynamic pruning     
Received: 06 June 2015      Published: 05 January 2016
CLC:  TP393  
Corresponding Authors: Kun JIANG     E-mail: jk_365@126.com
Cite this article:

Kun JIANG, Yue-xiang YANG. Efficient dynamic pruning on largest scores first (LSF) retrieval. Front. Inform. Technol. Electron. Eng., 2016, 17(1): 1-14.

URL:

http://www.zjujournals.com/xueshu/fitee/10.1631/FITEE.1500190     OR     http://www.zjujournals.com/xueshu/fitee/Y2016/V17/I1/1


Efficient dynamic pruning on largest scores first (LSF) retrieval

Inverted index traversal techniques have been studied in addressing the query processing performance challenges of web search engines, but still leave much room for improvement. In this paper, we focus on the inverted index traversal on document-sorted indexes and the optimization technique called dynamic pruning, which can efficiently reduce the hardware computational resources required. We propose another novel exhaustive index traversal scheme called largest scores first (LSF) retrieval, in which the candidates are first selected in the posting list of important query terms with the largest upper bound scores and then fully scored with the contribution of the remaining query terms. The scheme can effectively reduce the memory consumption of existing term-at-a-time (TAAT) and the candidate selection cost of existing document-at-a-time (DAAT) retrieval at the expense of revisiting the posting lists of the remaining query terms. Preliminary analysis and implementation show comparable performance between LSF and the two well-known baselines. To further reduce the number of postings that need to be revisited, we present efficient rank safe dynamic pruning techniques based on LSF, including two important optimizations called list omitting (LSF_LO) and partial scoring (LSF_PS) that make full use of query term importance. Finally, experimental results with the TREC GOV2 collection show that our new index traversal approaches reduce the query latency by almost 27% over the WAND baseline and produce slightly better results compared with the MaxScore baseline, while returning the same results as exhaustive evaluation.


关键词: Inverted index,  Index traversal,  Query latency,  Largest scores first (LSF) retrieval,  Dynamic pruning 
Fig. 1 An example showing the index traversal procedure of TAAT with two posting lists of the query terms 'piano' and 'music'. The solid line separates different iterations of the procedure, and the scoring operation scans from left to right in each iteration to accumulate the partial scores of all the term frequencies ft, d. The full result scores of the documents can be obtained when both 'piano' and 'music' are considered. Reprinted from Jiang and Yang (2015), Copyright 2015, with permission from Springer
Fig. 2 An example showing the index traversal procedure of DAAT with two posting lists of the query terms 'piano' and 'music'. The solid line separates different iterations of the procedure from left to right, and the scoring operation scans from top to bottom in each iteration to sum up all of the term frequencies ft, d. The full result score of a document can be obtained when both 'piano' and 'music' are considered in parallel. Reprinted from Jiang and Yang (2015), Copyright 2015, with permission from Springer
Fig. 3 An example showing the index traversal procedure of LSF with the query terms 'piano' and 'music'. The dashed line separates different candidate document scorings of a given posting list from left to right to obtain partial scores. The solid line separates different query term iterations to scan another posting list for candidate documents. The cross denotes that the posting has been considered in previous posting lists
Algorithm 1 Exhaustive OR_LSF retrieval
Algorithm 2 Dynamic pruning on LSF retrieval
Algorithm Average query latency (ms)
k=10 k=100 k=1000
OR_TAAT1290.11250.81275.1
OR_DAAT231.6232.6237.9
OR_LSF279.6281.2284.8
AND_TAAT238.7238.3238.4
AND_DAAT24.124.124.3
AND_LSF34.234.334.5
Table 1 Average query latency of the exhaustive index traversal techniques with different numbers of results
Algorithm Nhi Nsc (×103) Nde (×l03) Ncd
OR_TAAT10.02591.62591.6284.8
OR_DAAT119.52591.62591.6284.8
OR_LSF83.42591.63256.1650.0
AND_TAAT10.0207.62591.6284.8
AND_DAAT50.576.8227.7250.0
AND_LSF 50.5 76.8 227.7 250.0
Table 2  The average number of processed elements for different index traversal techniques with the number of results k = 10
Algorithm Average query latency (ms)
Average 2 3 4 5 >5
WAND 47.3 30.6 44.6 58.4 68.3 103.7
MaxScore 35.6 26.7 33.3 38.8 47.7 63.1
LSF_LO 61.1 38.9 56.9 70.2 89.9 140.3
LSF_PS 34.4 28.3 33.2 37.8 45.9 58.6
Table 3 Average query latency of different dynamic pruning techniques for different numbers of query terms
Algorithm Nhi Nsc (×l03) Nde (×l03) Ncd
WAND 119.4 114.6 379.5 280.0
MaxScore 119.4 215.4 238.9 223.3
LSF_LO 83.4 258.6 393.9 324.0
LSF_PS 83.4 188.5 219.0 240.6
Table 4 The average number of processed elements of different dynamic pruning techniques with the number of results k = 10
Algorithm Average query latency (ms)
k=10 k=50 k=100 k=500 k=1000
WAND 47.3 54.4 62.4 76.6 86.3
MaxScore 35.6 45.5 51.5 66.6 79.0
LSF_LO 61.1 67.3 70.4 86.9 94.4
LSF_PS 34.4 41.5 45.1 63.7 71.1
Table 5 Average query latency of different dynamic pruning techniques with different numbers of results
Fig. 4 Average query latency of different dynamic pruning techniques with various physical block sizes (k = 10)
1 Anh VN, Moffat A . Simplified similarity scoring using term ranks. 2005, Proc.28th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. 226-233. doi: 10.1145/1076034.1076075
2 Anh VN, Moffat A . Pruned query evaluation using pre-computed impacts. 2006, Proc. 29th Annual ACM SIGIR Conf. on Research and Development in Information Retrieval. 372-379. doi: 10.1145/1148170.1148235
3 Anh VN, Moffat A, . et al. . Index compression using 64-bit words. Softw. Pract. Exper. 2010, 40 (2): 131-147. doi: 10.1002/spe.948
4 Badue C, Ribeiro-Neto B, Baeza-Yates R, et al. . Distributed query processing using partitioned inverted files. 2001, Proc. 8th Int. Symp. on String Processing and Information Retrieval. 10-20. doi: 10.1109/SPIRE.2001.989733
5 Broder AZ, Carmel D, Herscovici M, . et al. . Efficient query evaluation using a two-level retrieval process. 2003, Proc. 12th Int. Conf. on Information and Knowledge Management. 426-434. doi: 10.1145/956863.956944
6 Buckley C, Lewit AF . Optimization of inverted vector searches. 1985, Proc. 8th Annual Int.ACM SIGIR Conf. on Research and Development in Information Retrieval. 97-110. doi: 10.1145/253495.253515
7 Büttcher S, Clarke CLA . Index compression is good, especially for random access. 2007, Proc. 16th ACM Conf. on Information and Knowledge Management. 761-770. doi: 10.1145/1321440.1321546
8 Büttcher S, Clarke CLA, Cormack GV . Information Retrieval: Implementing and Evaluating Search Engines 2010, USAThe MIT Press
9 Chakrabarti K, Chaudhuri S, Ganti V . Intervalbased pruning for top-k processing over compressed lists. 2011, Proc. 27th Int. Conf. on Data Engineering. 709-720. doi: 10.1109/ICDE.2011.5767855
10 Croft B, Metzler D, Strohman T . Search Engines: Information Retrieval in Practice 2010, USAAddison Wesley
11 Dean J . Challenges in building large-scale information retrieval systems: invited talk. 2009, Proc. 2nd ACM Int. Conf. on Web Search and Data Mining. 1 doi: 10.1145/1498759.1498761
12 Delbru R, Campinas S, Tummarello G . Searching web data: an entity retrieval and high-performance indexing model. Web Semant. Sci. Serv. Agents World Wide Web2012, 10: 33-58. doi: 10.1016/j.websem.2011.04.004
13 Dimopoulos C, Nepomnyachiy S, Suel T . Optimizing top-k document retrieval strategies for block-max indexes. 2013, Proc. 6th ACM Int. Conf. on Web Search and Data Mining. 113-122. doi: 10.1145/2433396.2433412
14 Ding S, Suel T . Faster top-k document retrieval using block-max indexes. 2011, Proc. 34th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. 993-1002. doi: 10.1145/2009916.2010048
15 Fontoura M, Josifovski V, Liu JH, et al. . Evaluation strategies for top-k queries over memory-resident inverted indexes. Proc. VLDB Endow, 2011, 1213-1224.
16 Jiang K, Yang YX . Exhaustive hybrid posting lists traversing technique. 2015, Proc. 5th Int. Conf. on Intelligence Science and Big Data Engineering. 1-11. doi: 10.1007/978-3-319-23862-3_1
17 Jiang K, Song XS, Yang YX . Performance evaluation of inverted index traversal techniques. 2014, Proc. 17th Int. Conf. on Computational Science and Engineering. 1715-1720. doi: 10.1109/CSE.2014.315
18 Jonassen S, Bratsberg SE . Efficient compressed inverted index skipping for disjunctive text-queries. 2011, Proc. 33rd European Conf. on Advances in Information Retrieval. 530-542. doi: 10.1007/978-3-642-20161-5_53
19 Lacour P, Macdonald C, Ounis I . Efficiency comparison of document matching techniques. 2008, Proc. European Conf. on Information Retrieval. 37-46.
20 Lester N, Moffat A, Webber W, et al. . Spacelimited ranked query evaluation using adaptive pruning. 2005, Proc. 6th Int. Conf. on Web Information Systems Engineering. 470-477. doi: 10.1007/11581062_37
21 Macdonald C, Ounis I, Tonellotto N . Upperbound approximations for dynamic pruning. ACM Trans. Inform. Syst. 2011, 29 (4): 17.1-17.28. doi: 10.1145/2037661.2037662
22 Manning CD, Raghavan P, Schütze H . Introduction to Information Retrieval 2008, USA Cambridge University Press, Cambridge
23 Melink S, Raghavan S, Yang B, et al. . Building a distributed full-text index for the Web. 2001, Proc. 10th Int. Conf. on World Wide Web. 396-406. doi: 10.1145/371920.372095
24 Moffat A, Zobel J . Self-indexing inverted files for fast text retrieval. ACM Trans. Inform. Syst. 1996, 14 (4): 349-379. doi: 10.1145/237496.237497
25 Ounis I, Amati G, Plachouras V, et al. . Terrier: a high performance and scalable information retrieval platform. 2006, Proc. OSIR Workshop. 18-25.
26 Puppin D, Silvestri F, Perego R , et al. . Tuning the capacity of search engines: load-driven routing and incremental caching to reduce and balance the load. ACM Trans. Inform. Syst. 2010, 28(2): 5.1-5.36. doi: 10.1145/1740592.1740593
27 Silvestri F, Venturini R . VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming. 2010, Proc. 19th ACM Int. Conf. on Information and Knowledge Management. 1219-1228. doi: 10.1145/1871437.1871592
28 Strohman T, Croft WB . Efficient document retrieval in main memory. 2007, Proc. 30th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. 175-182. doi: 10.1145/1277741.1277774
29 Strohman T, Turtle H, Croft W.B . Optimization strategies for complex queries. 2005, Proc. 28th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. 219-225. doi: 10.1145/1076034.1076074
30 Turtle H, Flood J . Query evaluation: strategies and optimizations. Inform. Process. Manag. 1995, 31(6): 831-850. doi: 10.1016/0306-4573(95)00020-H
31 Wang LD, Lin J, Metzler D . A cascade ranking model for efficient ranked retrieval. 2011, Proc. 34th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. 105-114. doi: 10.1145/2009916.2009934
32 Zobel J, Moffat A . Inverted files for text search engines. ACM Comput. Surv 2006, 38(2): 6.1-6.56. doi: 10.1145/1132956.1132959
33 Zukowski M, Heman S, Nes N , et al. . Super-scalar RAM-CPU cache compression. 2006, Proc. 22nd Int. Conf. on Data Engineering. 59 doi: 10.1109/ICDE.2006.150
[1] Lie-fu Ai, Jun-qing Yu, Yun-feng He, Tao Guan. High-dimensional indexing technologies for large scale content-based image retrieval: a review[J]. Front. Inform. Technol. Electron. Eng., 2013, 14(7): 505-520.