, 2016, 17(1): 1-14 doi: 10.1631/FITEE.1500190

Original article

Efficient dynamic pruning on largest scores first (LSF) retrieval

JIANG Kun,, YANG Yue-xiang

College of Computer, National University of Defense Technology, Changsha 410073, China

Corresponding authors: †E-mail: jk_365@126.com

First author contact: Corresponding Author

Received: 2015-06-6   Accepted: 2015-10-14  

Abstract

Inverted index traversal techniques have been studied in addressing the query processing performance challenges of web search engines, but still leave much room for improvement. In this paper, we focus on the inverted index traversal on document-sorted indexes and the optimization technique called dynamic pruning, which can efficiently reduce the hardware computational resources required. We propose another novel exhaustive index traversal scheme called largest scores first (LSF) retrieval, in which the candidates are first selected in the posting list of important query terms with the largest upper bound scores and then fully scored with the contribution of the remaining query terms. The scheme can effectively reduce the memory consumption of existing term-at-a-time (TAAT) and the candidate selection cost of existing document-at-a-time (DAAT) retrieval at the expense of revisiting the posting lists of the remaining query terms. Preliminary analysis and implementation show comparable performance between LSF and the two well-known baselines. To further reduce the number of postings that need to be revisited, we present efficient rank safe dynamic pruning techniques based on LSF, including two important optimizations called list omitting (LSF_LO) and partial scoring (LSF_PS) that make full use of query term importance. Finally, experimental results with the TREC GOV2 collection show that our new index traversal approaches reduce the query latency by almost 27% over the WAND baseline and produce slightly better results compared with the MaxScore baseline, while returning the same results as exhaustive evaluation.

Keywords: Inverted index ; Index traversal ; Query latency ; Largest scores first (LSF) retrieval ; Dynamic pruning

PDF (451KB) Metadata Metrics Related articles Export EndNote| Ris| Bibtex  Favorite Suggest

Cite this article

JIANG Kun, YANG Yue-xiang. Efficient dynamic pruning on largest scores first (LSF) retrieval. [J], 2016, 17(1): 1-14 doi:10.1631/FITEE.1500190

1. Introduction

One major problem in the query processing of search engines is that the length of the posting list can easily grow to hundreds of MBs or even GBs for common terms (Dean, 2009). Given that search engines need to answer queries within fractions of a second, naively traversing the basic index structure, which could take hundreds of milliseconds, is not acceptable. This problem has long been recognized by researchers and has motivated much work on optimization techniques, including inverted index partitioning, distribution, index compression, and index traversal (Turtle and Flood, 1995; Badue et al., 2001; Melink et al., 2001; Anh and Moffat, 2010; Puppin et al., 2010). In this paper, we focus on exhaustive index traversal techniques, which naively access all the postings that are relevant to the query terms, and optimization techniques called dynamic pruning (Moffat and Zobel, 1996; Chakrabarti et al., 2011), which return the best k results without an exhaustive traversal of the entire relevant posting lists.

Ingenious index traversal techniques can efficiently quicken the query processing procedure and reduce the hardware costs of search engines significantly (Croft et al., 2010). The scheme for traversing posting lists within a document-sorted index for a query falls into two main classes: term-at-a-time (TAAT) and document-at-a-time (DAAT) (Turtle and Flood, 1995). The main disadvantage of TAAT techniques is that they require large amounts of memory to store partial scores when traversing each posting list, especially for large index sizes. However, the iterative candidate selection of DAAT techniques has high costs associated with the random access of all the posting lists, which requires jumping back and forth between different posting lists. To address the shortcomings of the existing traversal techniques, we propose a novel exhaustive index traversal technique called largest scores first (LSF) retrieval, which can reduce the memory consumption from that of TAAT and avoid the costly candidate selection of DAAT at the expense of revisiting some of the remaining posting lists. In our previous study, the prototype of LSF retrieval was proposed, but without enough consideration of computational complexity analysis (Jiang and Yang, 2015). In addition, we describe two novel rank safe dynamic pruning heuristics, list omitting and partial scoring on LSF, i.e., LSF_LO and LSF_PS, to reduce the number of postings that need to be processed. Experimental results show that our LSF retrieval is faster than the TAAT retrieval and comparable to the DAAT retrieval, and also has potential advantages over dynamic pruning techniques. Further experiments show that our LSF-based pruning techniques reduce disjunctive query processing time by almost 27% over the WAND baseline and are slightly better than the MaxScore baseline, while returning the same results as exhaustive evaluation. With more query terms or more returned results, the LSF_LO and LSF_PS schemes show more improvements compared with the two baselines.

2 Related work

2.1 Inverted index

The inverted index plays a vital role in the efficient processing of ranked and Boolean queries (Zobel and Moffat, 2006). Given a collection of N documents, we assume that each document is identified by a unique document identifier (docid) between 0 and N-1. An inverted index can be seen as an array of lists or postings, where each entry of the array corresponds to a different term or word in the collection. The set of terms is called the lexicon of the collection. For each term t in the lexicon, the index contains a posting list It consisting of a number of postings describing all places where term t occurs in the collection. More precisely, each posting in It contains the docid d and the number of occurrences of t in the document (called term frequency, denoted as ft, d), and possibly other information about the locations of the occurrences and their contexts. We assume postings have docids and frequencies, but do not consider other data such as positions or contexts in this paper; thus, each posting is of the form (d : ft, d). However, we believe that our techniques can also apply to cases where other information is stored for future work. The tuple (d : ft, d) denotes the docid d and score ft, d in document d.

The lexicon is comparatively small in most cases. However, the posting lists may consist of many millions of postings, which could be approximately linear with the size of the collection (Manning et al., 2008). The posting lists can be organized in either document-sorted indexes or impact-sorted indexes. The document-sorted index is the standard approach for basic exhaustive query processing, where the postings in each posting list are sorted by ascending docids. The impact-sorted index (commonly the impact is computed by term frequency) is organized in such a way that postings in each list are sorted by their impact. A problem with impact-sorted indexes is that compression could suffer as d-gaps in the posting lists may be very large (Ding and Suel, 2011; Dimopoulos et al., 2013). In this study, we focus on the document-sorted index. To allow faster access and limit the amount of memory needed, search engines use various compression techniques that significantly reduce the size of the posting lists within document-sorted indexes (Zukowski et al., 2006; Büttcher and Clarke, 2007; Silvestri and Venturini, 2010).

Researchers have presented additional synchronization structure into chunks of the posting lists and accelerated the query processing by skipping over compressed chunks (Moffat and Zobel, 1996; Strohman and Croft, 2007; Jonassen and Bratsberg, 2011). This facilitates searching for a particular docid within a posting list by skipping over a large number of compressed postings. Additional skipping pointers have to be embedded in the posting list, or a skipping table describing the front of a few selected chunks has to be stored together with the lexicon, both describing a potential skip when decoding. This allows entire chunks that would be evaluated by an exhaustive evaluation to be skipped when searching for a given docid.

2.2 Index traversal

Most search engines first use a simple ranking function to compute a score for each document that passes Boolean query filters, i.e., conjunctive (AND) and disjunctive (OR) query filters. Conjunctive queries make a compulsory requirement that documents matching all of the query terms can be returned as a result. Disjunctive queries relax the requirement such that documents matched by any of the query terms may be reported as a result. For a large number of documents that pass the Boolean query filter, users are commonly interested in the top-k results. Thus, search engines need only to return the k results with the highest similarity that match the query, and send the k results to a complex ranking process with hundreds of features. In this study, we use the bag-of-words similarity ranking model for the first simple ranking phase, in which the final score can be decomposed into a sum or simple combination of per-term scores. Thus, the scoring function S (q, d) can be defined as $S(q, d)= \sum\nolimits_{t \in q} {s(t, d)} $ , where q is the set of all terms in the query and S (q, d) is the similarity score of query term t and current docid d.

As mentioned, the schemes to exhaustively traverse posting lists within a document-sorted index for a query fall into mainly two categories: term-centric TAAT and document-centric DAAT (Turtle and Flood, 1995). With TAAT techniques, the query results are obtained by sequentially traversing one posting list at a time. As documents will occur in different posting lists, intermediate scores should be kept in accumulators until all of the occurrences of the document are considered. A simple example of TAAT traversal is shown in Fig. 1. With DAAT techniques, the posting lists that relate to a query are processed in parallel. As a consequence, the final scores of documents can be obtained and inserted into a heap structure before moving to the next document. The system may always store the top-k results considered thus far and also the threshold of the result heap. Then if the score of a new candidate surpasses the threshold, it will be inserted into the result heap. At the end of the procedure, the k results in the heap will be required by the user, so the memory requirements are fairly small. A simple example of DAAT traversal is shown in Fig. 2.

Fig. 1

Fig. 1   An example showing the index traversal procedure of TAAT with two posting lists of the query terms 'piano' and 'music'. The solid line separates different iterations of the procedure, and the scoring operation scans from left to right in each iteration to accumulate the partial scores of all the term frequencies ft, d. The full result scores of the documents can be obtained when both 'piano' and 'music' are considered. Reprinted from Jiang and Yang (2015), Copyright 2015, with permission from Springer


Fig. 2

Fig. 2   An example showing the index traversal procedure of DAAT with two posting lists of the query terms 'piano' and 'music'. The solid line separates different iterations of the procedure from left to right, and the scoring operation scans from top to bottom in each iteration to sum up all of the term frequencies ft, d. The full result score of a document can be obtained when both 'piano' and 'music' are considered in parallel. Reprinted from Jiang and Yang (2015), Copyright 2015, with permission from Springer


The DAAT and TAAT techniques can be enhanced into four algorithms with different query filters, i.e., AND_DAAT, OR_DAAT, AND_TAAT, and OR_TAAT. The evaluation of these index traversal techniques can be found in our previous work (Jiang et al., 2014). The result sets returned by conjunctive queries are potentially smaller than those returned by disjunctive queries. This might be a problem, especially when one of the terms is missing or mistyped. However, one of the main challenges associated with the two query filters is that disjunctive queries tend to be significantly more expensive than conjunctive queries, as they have to evaluate many more documents. The shortest posting list can be used to efficiently skip through the longer posting lists and thus reduce the amount of data to be processed. For this reason, the optimization of disjunctive queries has become an important research topic in index traversal (Büttcher et al., 2010; Jonassen and Bratsberg, 2011).

Impact-sorted indexes can be processed in either term-at-a-time or score-at-a-time (Anh and Moffat, 2005), in which all of the posting lists are simultaneously open, with the merge by decreasing contribution, so that pointers that make big contributions to document scores can be processed first. The score-at-a-time scheme has been proposed mainly for dynamic pruning that can obtain the top scored results as soon as possible (Strohman and Croft, 2007).A problem with impact-sorted indexes is that compression could suffer as d-gaps in the posting lists may be very large. In contrast, document-sorted indexes tend to be much smaller with different compression algorithms (Ding and Suel, 2011; Dimopoulos et al., 2013). Dynamic pruning techniques on impact-sorted indexes are useful when the number of distinct impact layers is small, or frequencies are used instead of impacts (Ding and Suel, 2011).

2.3 Dynamic pruning

Dynamic pruning techniques return the top-k scored candidates, which avoids fully evaluating all documents that satisfy the disjunctive or conjunctive condition (Ding and Suel, 2011). A min-heap (usually used in DAAT) and sets of accumulators (usually used in TAAT) are used to store the top-k documents during evaluation, and we normally estimate and store the maximum achievable score of a posting list together with its lexicon term. The dynamic pruning techniques can be done in terms of the maximum score of the posting list and the threshold of the result heap or accumulator score. We define the resulting safe dynamic pruning techniques as the ones that can return the same top-k results as the exhaustive evaluation, and the rank safe techniques are more strictly defined as those that can return all of the results in the same order as the exhaustive evaluation.

One of the earliest optimizations in TAAT processing comes from Buckley and Lewit (1985). In this approach, terms are sorted and processed from the least to the most frequent term, so that the important terms come first to obtain a larger accumulated score as early as possible. After processing a query term, it may be possible to show that it is not necessary to consider any more terms if they do not much affect the final results. Thus, one or more lists for the query terms are completely omitted. As the document at rank k+1 cannot surpass the score of the document at rank k, the processing terminates for a set of accumulators with top-k partial scores. This approach belongs to the result safe scheme for which only the identities of the top-k results are fixed.

The TAAT MaxScore proposed by Turtle and Flood (1995) is somewhat similar to the method presented by Buckley and Lewit (1985). The improvement lies in a rank safe technique in which terms evaluated until the top-k documents are guaranteed to be in the correct order. To ensure the rank safe results, the remaining posting lists should also be accessed for adding scores of the postings that already exist in the accumulator tables. Another improvement lies in that the approach removes the accumulators that cannot be part of the top-k results after processing each term.

Moffat and Zobel (1996) presented quit and continue optimization for TAAT processing. The quit heuristic dynamically adds accumulators until the number of accumulators meets the fixed threshold. At this point, documents are returned to the user with the partial scores in the accumulators. The continue heuristic continues evaluation when the accumulator threshold is reached but considers only those documents that already have accumulators allocated. Lester et al. (2005) further introduced an efficient space-limited adaption of quit and continue optimization, which provides a trade-off between the number of accumulators and the result quality.

The DAAT MaxScore (Turtle and Flood, 1995; Lacour et al., 2008; Fontoura et al., 2011) takes advantage of only the partial scoring scheme for optimization. When a document is scored, we add to the partial score of the document the sum of the accumulated maximum score of the remaining terms to be scored. If this sum is less than the threshold of the min-heap, then we do not calculate the term score of the document, as the document could not be in the top-k documents. The disadvantage of this scheme commonly applied in the DAAT index traversal technique is that the candidate is selected not in the posting list of the most important term but always the lowest current docid of all the posting lists. For the candidate that will not be inserted into the result heap, if the terms with larger maximum scores are evaluated first, then the partial score plus the maximum score of the remaining terms tends to be smaller than the threshold much earlier, which avoids accessing the posting lists with less contribution.

The description of MaxScore by Strohman et al. (2005) differs from the original one, since the term importance is considered. The improvement is that with the increase of the number of results in the heap, the threshold becomes larger and larger. At some point, no document can make it into the top-k results just using candidates selected in query terms with the maximum score that is lower than the threshold. Thus, the candidate can be selected only from important terms with larger maximum scores. More recently, Jonassen and Bratsberg (2011) presented a unified description of the above two versions with efficient skipping operations. In the latest MaxScore, posting lists are sorted from top to bottom by their maximum scores and divided into essential lists and non-essential lists in terms of the accumulated maximum score sum up to less than the threshold or not. The essential lists with larger maximum scores are possibly more important, and the candidates are selected and scored first in such lists. The partial scoring and skipping are applied in the non-essential lists with less contribution.

Broder et al. (2003) proposed an ingenious posting lists sorting and skipping movement metric named WAND based on DAAT. The candidate term is selected by summing up the maximum scores of the query terms in ascending order of the list pointer docids, until the sum becomes larger than the threshold of the result heap. The current docid of the term is selected as the candidate. The posting list with its docid smaller than the pivot should be processed in AND mode, and then skipping and scoring are conducted. Analysis and experimental results proposed by Dimopoulos et al. (2013) show that WAND is not as efficient as MaxScore due to the frequent posting lists sorting and the lack of partial scoring.

Anh and Moffat (2005; 2006) described efficient dynamic pruning methods on an impact-sorted index with score-at-a-time traversal. The methods trim the accumulator table every time meeting with a new impact score when the top-k results have been determined. Strohman and Croft (2007) further improved the pruning technique by continuous accumulator pruning and the combination of skipping and skipping length. The candidate documents are selected by the smallest docid in each segment of the query terms with equal term importance. The idea of continuous trimming can be seen as immediate partial scoring with different impact layers.

3 Largest scores first traversal

In this section, we present a novel exhaustive index traversal technique called largest scores first (LSF) and give a detailed implementation description of its main methods.

3.1 Largest scores first

We describe a novel index traversal technique called LSF retrieval which makes full use of query term importance. With LSF traversal, it first iterates the posting list for every query term by sorting the posting lists in ascending order of document frequency so that we evaluate the shortest list first, but the postings are accessed in an order that is neither strictly term-centric nor strictly document-centric. Not surprisingly, the posting list with the highest document frequency is almost always the one with the largest upper bound score in common ranking functions, such as BM25. This means that the usual estimates of term importance (e.g., idft =log (N/Nt), where idf represents the inverse document frequency) will decrease for lists encountered later in the evaluation. The iteration from the most significant term to the lesser makes the most promising documents appear early for scoring. Thus, the threshold of the result heap will rise quickly, which can avoid lower scored documents from being inserted into the heap. Then we choose one of the query terms with the shortest posting list length as the candidate selection term (CST) and the remaining terms as scoring adding terms (SATs). Next, we select the postings of the candidate selection term as the candidate document and access all posting lists of the query terms simultaneously for scoring when it appears. We should check if the candidate has been considered in previous candidate selection terms except for the first candidate selection term. Finally, we insert the result document and its full score into the result heap. Every time we reach the end of the candidate selection term, we remove the term from the query term set and select another term with the most significant term as the candidate selection term; additionally, we select those remaining as the SAT if the set is not empty. An example showing the procedure of LSF is depicted in Fig. 3, in which the score of each docid can be obtained by adding all of the term frequencies in the tuple of the posting with that docid for simplicity.

Fig. 3

Fig. 3   An example showing the index traversal procedure of LSF with the query terms 'piano' and 'music'. The dashed line separates different candidate document scorings of a given posting list from left to right to obtain partial scores. The solid line separates different query term iterations to scan another posting list for candidate documents. The cross denotes that the posting has been considered in previous posting lists


3.2 Algorithm and analysis

Given the inverted index and the result heap, LSF returns the k documents that have the highest scores according to the scoring function S (q, d). The details of the technique are depicted in Algorithm 1. As shown, LSF iterates the posting list for every query term by first sorting the posting lists by ascending length. For a given posting list, it selects a candidate document d and checks if it has been considered before (lines 5-7). We have used the bitmap data structure to store the documents that were already considered. This can reduce the revisiting of postings. Then every scoring posting list should first skip to the candidate document d (line 12), and, if the candidate does not exist in the list, it will point to the first document larger than d. If it points to the candidate, just add its contribution to the accumulated score using a simple ranking function (lines 13-15). When all of the posting lists are considered for d, we insert the candidate d into the result heap if its score is larger than the threshold (line 17). When the posting list of the candidate selection term reaches the end, the scoring posting lists should be reset to make the cursor point to its first document. Finally, we return the top k scoring documents of the results in the heap (line 24).

Algorithm 1

Algorithm 1   Exhaustive OR_LSF retrieval


For n query terms and Nt matching documents (containing at least one query term), the first pass of the query costs $\Theta(N_{t_1} \cdot n +N_{t_1} \cdot \log_2 k)$. Here, Nt is the number of documents containing t, $N_{t_1} \cdot n$ corresponds to the times of scorings in line 14, and $N_{t_1} \cdot \log_2 k$ corresponds to the times of sortings in line 17. For the ith query pass, corresponding to the ith loop in line 14, the cost is $\Theta ({N_{{t_1}}} \cdot (n - (i - 1)) + {N_{{t_1}}} \cdot {\log _2}k)$. Depending on how often the query terms co-occur in a document, the posting can be scored n- (i-1) times for the ith loop(line 14)in the worst case. The time complexity of the duplicated checking for the ith (i>1) loop is $\Theta(\sum\nolimits_{k = 1}^{i - 1} {{N_{{t_k}}}})$. Thus, for all n loop iterations, the overall time complexity of the algorithm is

$\Theta \left({\sum\limits_{i = 1}^n {{N_{{t_i}}}(n -(i - 1))} + {N_q}{{\log }_2}k + \sum\limits_{i = 1}^n {\sum\limits_{k = 1}^{i - 1} {{N_{{t_k}}}} } } \right), $

where $N_q=N_{t_1}+N_{t_2}+\cdots+N_{t_n}$. Since the inverted index iterators are sorted by ascending list length, we can deduce that $\Theta(\sum\nolimits_{k = 1}^{i - 1} {{N_{{t_k}}}})< \Theta({N_{{t_i}}} \cdot(i - 1))$. Then we find that

$\begin{array}{*{20}{l}}{\Theta \left( {\sum\limits_{i = 1}^n {{N_{{t_i}}}} (n - (i - 1)) + {N_q}{{\log }_2}k + \sum\limits_{i = 1}^n {\sum\limits_{k = 1}^{i - 1} {{N_{{t_k}}}} } } \right)}&{}\\{\Theta \left( {\sum\limits_{i = 1}^n {{N_{{t_i}}}} (n - (i - 1)) + {N_q}{{\log }_2}k} \right.\left. { + \sum\limits_{i = 1}^n {{N_{{t_i}}}} (i - 1)} \right)}&{}\\{ = \Theta \left( {\sum\limits_{i = 1}^n {{N_{{t_i}}}} n + {N_q}{{\log }_2}k} \right) = \Theta ({N_q}n + {N_q}{{\log }_2}k).}&{}\end{array}$

It was reported in Büttcher et al. (2010) that the time complexities of DAAT and TAAT with normal implementation are both $\Theta(N_q n+N_q \log_2 k)$. Although the complexity of the three techniques is the same, the run time cost is quite different. The superiority of LSF is that it avoids having a large memory footprint and the generation of the smallest docid in candidate selection. The disadvantage of LSF processing is the revisiting of the remaining posting lists, although it can be reduced by the dynamic pruning techniques, and the time consuming duplication checking due to the large number of if-else branch predictions in confirming the posting visitings and bit operations in setting and obtaining the exact position of the bitmap.

The memory consumption of LSF lies mainly on the priority queue with k results and the bitmap data structure. That is almost the same as DAAT traversal, which needs only a priority queue with k results. However, the TAAT traversal keeps all the intermediate results in memory, which can be linear with the documents in the data sets. Thus, we omit the consideration of exhaustive TAAT and its dynamic pruning techniques in the following sections, due to our preliminary analysis and other descriptions in Lacour et al. (2008) and Fontoura et al. (2011).

3.3 Posting list iterator

We now describe an efficient posting list iterator that directly operates on the physical index data and provide methods for various index traversal techniques. For a given posting, the iterator contains mainly two important attributes: 'currentptr' is the address of a pointer of the posting list, and 'current-did' is the document identifier of a given posting. There are also other attributes that can be used to score the document, but we do not need them when traversing the posting list. Regardless of the underlying index structure, we define five basic methods on the posting list iterator as an abstract data type:

  1. 1.getdid () returns the docid of the current position in the posting list where the pointer address is 'currentptr'.The iterator is initialized by 'current-did' to be considered as the docid of the first posting element in its posting list.
  2. 2.getscore () computes the partial score of 'currentdid' in the posting list considered by s (ti, d), using the data stored in the posting lists (e.g., frequencies or the number of postings) and other statistics stored outside the inverted index (e.g., document lengths).
  3. 3.reset () sets the current address of the posting list to zero and initializes its current posting to be the first posting element in the posting list, i.e., $\mathrm{currentptr}\leftarrow 0$, $\mathrm{currentposting}\leftarrow t.\mathrm{iterator}[\mathrm{currentptr}]$.
  4. 4.next () increments the current address to return the next posting of term t's posting list, i.e., $\mathrm{currentptr}\!+\!\!+\!$, $\mathrm{currentposting}\leftarrow t.\mathrm{iterator}[\mathrm{currentptr}]$ after initialization. If the iterator reaches the end of the posting list, it calls the close function to release the memory source.
  5. 5.skipto (d) randomly accesses the position with docid equal to d or the one next to docid d in the posting list. The details of the skipping procedure depend on the implementation of the skipping structure.

The skipping structure we use is similar to that proposed by Jonassen and Bratsberg (2011) , in which the postings are separated into data chunks and the skipping pointers are built upon to generate hierarchical skipping chunks. All the data chunks and skipping chunks are of the same length L. The last docid and the end-offset of each chunk are recursively stored in a chunk at the level above. The skipto (d) operation can be performed by comparing d with the last docid of the active chunk. When d is greater, the higher level chunk is considered as the active chunk. Conversely, the chunk referred to by the offset of the docid is decoded. The search goes down iteratively until the data chunk is reached. The binary search is used within each chunk for the element indicating the lower level chunk or the exact target d.

For compressed posting lists, the worst-case number of decompressed chunks and fetched blocks in a random skipto (d) operation is equal to the number of skip-levels, i.e., $L_t=\log_LN_t$, where Nt is the collection frequency of the term t and L is the equal skipping chunk length(L=128 in this study). The skiplist structure is integrated into the posting lists. For a posting list with length Nt and skipping chunk length L, we can deduce the total number of the extra skiplists ${E_t} = \sum\nolimits_{i = 1}^{{L_t}} {\left\lceil {{N_t}/{L^i}} \right\rceil } $, which sums the number of skipping pointers expected at each skipping level. The search complexity of the skiplists is defined by the number of exact steps necessary to find the target posting. In the worst case, the number of steps at each level is at most L/2 with at most Lt levels. Thus, the search complexity of the skiplists is $\Theta(L_t\cdot L/2)=\Theta(\log N_t)$.

4 Dynamic pruning on LSF retrieval

In this section, we present the checking of the candidate for pruning postings of our rank safe dynamic pruning techniques based on LSF retrieval.

4.1 Ignoring postings

The basic idea of the proposed pruning technique can be achieved by monitoring two basic quantities: a threshold value $\theta$ and a remainder function $\rho$. The threshold value $\theta$ is a lower bound on the score of the last document that will be returned to the user. Here, we assume that k represents the number of documents requested. We can compute a reasonable $\theta$ by using the kth largest score in the current result heap (or accumulator table). If k accumulators do not yet exist, $\theta=0$. The second monitored quantity is the score remainder function $\rho$. This function computes an upper bound on the total additional score that the posting could possibly achieve through further processing of the remaining terms' posting lists. The upper bound of the posting list can be estimated with an on-the-fly scheme in query processing (Macdonald et al., 2011) or be exactly computed in the index construction phase that we adopt here (Ding and Suel, 2011; Jonassen and Bratsberg, 2011).

We define the function $\rho$ as follows: $\rho({t_i})$ is the largest score that the posting could achieve in the posting list for term $t_i$; $\rho(q)= \sum\nolimits_{t \in q} {(t)} $, where q is the set of terms in the query. We also maintain the threshold score $\theta$ as the kth largest score in the result heap. We define $q_\textrm{r}$ as the set of the remaining terms thus far unconsidered. Every time we finish one posting list of a given term, we check if $\theta>\rho(q_\textrm{r})$, where $\rho(q_\textrm{r})$ is the largest achievable score of new postings in the remaining terms. When $\theta>\rho(q_\textrm{r})$, we know that no more posting lists of the remaining terms need to be considered. This is true because no more postings will achieve a score greater than $\rho(q_\textrm{r})$. Thus, we terminate the processing and return the best k scoring documents in the heap. We call this method LSF index traversal with list omitting (LSF_LO). Performance gains are achieved when the termination condition is met as no more posting lists of term t $(t\in q_\textrm{r})$ need to be processed. However, the performance cost lies in the revisiting of posting lists of term t. The lists for the query terms are sorted in ascending order by their upper bounds, and the lists with larger upper bounds should be considered first to make the high scoring postings appear as early as possible.

For an in-depth posting ignoring heuristic, when $\theta>\rho(q)$ (only occurring in the first posting list considered), we can conclude that the remaining postings in the current posting list need to exist in all other posting lists. The posting that does not meet the condition will be ignored. Moreover, instead of checking the termination condition at the end of every list traversal, we perform immediate checking of the achievable score of every posting. Thus, at every step when considering the next posting in the current list, instead of fully scoring each such candidate in all remaining posting lists, we perform partial scoring as follows. We first evaluate the score of this posting in the current list. If the sum of the score plus the upper bounds of the remaining lists is less than threshold $\theta$, then we can abort the evaluation of the posting, which avoids large amounts of skipping. Otherwise, we score the candidates in the remaining posting lists until either we have a complete score or the maximum possible score drops below the threshold $\theta$. We call this pruning method LSF index traversal with partial scoring (LSF_PS).

4.2 Algorithm description

In this subsection, we present an in-depth description of the implementation of the algorithm. The details of the pseudo code are depicted in Algorithm 2. The algorithm combines the pruning technique of partial scoring and list omitting in LSF. Prior to query processing, we order the iterators by descending upper bounds and calculate their cumulative upper bounds from last to first. We maintain a heap to store k candidates.

Algorithm 2

Algorithm 2   Dynamic pruning on LSF retrieval


Each iteration begins by finding and scoring the lowest docid of the posting list with the highest upper bound (line 2). We call the term of this posting list the candidate generating term and the other terms with smaller upper bounds score adding terms. Furthermore, we check every other iterator of the score adding terms in descending order, and, if the current score plus the cumulative upper bounds of the remaining terms is less than the threshold, the algorithm stops further evaluation for this candidate and proceeds to the next one (lines 12-14). This is the core idea of partial scoring in LSF_PS. All terms that are used for scoring can be skipped efficiently (line 15). When a document is inserted to the full heap with k candidates, we set the threshold to the minimum score of the candidates in the heap (line 20). When the cumulative upper bound of the remaining terms is less than the threshold, the algorithm terminates because no more candidate documents that have not appeared can be added to the result heap (lines 23-25). This is the core idea of list omitting in LSF_LO. Otherwise, all of the iterators of the scoring terms are reset for another iteration (lines 26-28).

5 Experiments

In our experiments, we used the TREC GOV2 collection containing approximately 25.2 million documents and approximately 32.8 million terms in its vocabulary, with an uncompressed size of 426 GB. The index structure contains three parts, including the lexicon file, posting lists, and the statistical file. We built posting lists and the hierarchical skipping structure with 128 docids per chunk using the PForDelta compression algorithm (Zukowski et al., 2006), removing standard English stopwords, and applying Porter's English stemmer. We generated the maximum score of every term in the lexicon according to the contribution of its postings and stored them along with the lexicon file. The statistical file stored the total number of unique terms, documents, postings, etc. The final compressed index size was 6.72 GB. We also built another index without skipping, and found that the skiplist structure added about 82.1 MB to the index size, which is a 1.19% increase. We used the first 10 000 queries selected from the TREC 2005 Efficiency Track Queries using distinct numbers of terms with $|q|\geq 2$, and we separated the queries into those with length from 2 to 5 and those with length larger than 5.

Our experiments were performed on an Intel® Xeon® E5620 processor running at 2.40 GHz with 8 GB RAM and 12 288 KB cache. The default physical block size is 16 KB unless stated otherwise.All traversal solutions were implemented in JAVA on the Terrier retrieval platform, which is widely used in research on information retrieval (Ounis et al., 2006). All of the code is openly available at https://github.com/deeper2/LSF. In every experiment where we report running time, the JVM was first executed a certain number of times the benchmark for warmup and the numbers were averaged over three independent runs when the JVM has reached a steady state performance (Delbru et al., 2012). The default number of documents retrieved for each query equaled 10 on a results page. The performance was always measured by average time per query in milliseconds. As we focused on rank safe query optimization techniques, the effectiveness measurement is ignored in the following experiments.

In this study, we used Okapi BM25 as the ranking function (Büttcher et al., 2010).The standard Okapi BM25 bag-of-words similarity scoring function as a reference point for S (q, d) is based on their predicted relevance to the query, defined as

$\begin{array}{l}S(q, d)= \sum\limits_{t \in q} s(t, d)= \sum\limits_{t \in q} {{\rm{id}}{{\rm{f}}_t}} \cdot {\rm{T}}{{\rm{F}}_{{\rm{BM25}}}}(t, d)= \sum\limits_{t \in q} {\log }(N/{N_t})\cdot {\rm{T}}{{\rm{F}}_{{\rm{BM25}}}}(t, d), \end{array}$

where

${\rm{T}}{{\rm{F}}_{{\rm{BM25}}}}(t, d)= \frac{{{f_{t, d}} \cdot({k_1} + 1)}}{{{f_{t, d}} + {k_1} \cdot [(1 - b)+(b \cdot {l_d}/{l_{{\rm{avg}}}})]}}$

is the document frequency. Note that ld is the number of symbols in document d and lavg is the average of ld over the collection. The formula has two free parameters, k=1.2 and b=0.75, as default.

5.1 Exhaustive index traversal

As mentioned, one of the main challenges associated with the processing of disjunctive (OR) queries is that they tend to be much slower than conjunctive (AND) queries, as they have to evaluate more documents. The two types of queries can be applied to each of the index traversal strategies mentioned above. We implemented six types of exhaustive index traversal algorithms, namely disjunctive DAAT, disjunctive TAAT, disjunctive LSF, conjunctive DAAT, conjunctive TAAT, and conjunctive LSF based on the code of the Terrier retrieval platform.A preliminary comparison of exhaustive index traversal techniques, including LSF, DAAT, and TAAT, is depicted in Table 1.

Table 1   Average query latency of the exhaustive index traversal techniques with different numbers of results

Algorithm Average query latency (ms)
k=10 k=100 k=1000
OR_TAAT1290.11250.81275.1
OR_DAAT231.6232.6237.9
OR_LSF279.6281.2284.8
AND_TAAT238.7238.3238.4
AND_DAAT24.124.124.3
AND_LSF34.234.334.5

New window| CSV


From Table 1, we can see that disjunctive queries tend to be significantly more expensive (by approximately an order of magnitude) than conjunctive queries with all of the index traversal techniques. This finding suggests the necessity of reducing the huge performance gap between disjunctive queries and conjunctive queries. Thus, we focus on disjunctive queries of LSF in this study, i.e., those that do not disqualify one document when it misses some of the results. In addition, we can find that DAAT runs much faster than TAAT in both disjunctive and conjunctive queries. The LSF traversal performs a little slower than DAAT but much better than TAAT in both disjunctive and conjunctive queries. This is mainly due to the shortage of the main memory used for storing all document information during TAAT processing with increasing data sets and the miss rate of the cache of the CPU caused by random access in accumulators used for storing intermediate results. Overall, the LSF exhaustive index traversal technique gives comparable performance results to those of the DAAT exhaustive index traversal technique, and we intend to improve the performance further via dynamic pruning techniques.

We also measure the performance by other criteria, including some processed elements, i.e., the total number of candidates inserted into the result heap, calls to the scoring function, docids evaluated, and decompressed chunks. These criteria provide in-depth descriptions of the above query latency measurement. These criteria have also been used in previous works (Lacour et al., 2008; Jonassen and Bratsberg, 2011). Table 2 shows other criteria for the six index traversal techniques with the number of results k=10.

Table 2   The average number of processed elements for different index traversal techniques with the number of results k = 10

Algorithm Nhi Nsc (×103) Nde (×l03) Ncd
OR_TAAT10.02591.62591.6284.8
OR_DAAT119.52591.62591.6284.8
OR_LSF83.42591.63256.1650.0
AND_TAAT10.0207.62591.6284.8
AND_DAAT50.576.8227.7250.0
AND_LSF 50.5 76.8 227.7 250.0

Nhi: number of heap inserts; Nsc: number of scorings called

Nde: number of docids evaluated; Ncd: number of chunks decoded

New window| CSV


As observed inTable 2, the disjunctive queries processed considerably more elements than did the conjunctive queries in all cases with different exhaustive traversing techniques. This is because skipping can be used in conjunctive queries to avoid processing a large number of postings by allowing the shortest posting list to skip through other posting lists to the pointer address with the same docid. With disjunctive queries, the numbers of scoring functions called, docids evaluated, and chunks decoded are the same for DAAT and TAAT. This is because all of the postings or chunks are considered, just at different periods in the schema. However, the numbers of docids evaluated and chunks decoded of LSF increase significantly due to the revisiting of chunks and postings. The number of scorings called is the same as the disjunctive baseline, due to the duplication checking for posting scoring. With conjunctive queries, the three criteria of DAAT and LSF are significantly reduced compared with TAAT. This is because the largest docid of the current position can be selected as the candidate document in DAAT, and a lot of postings that appear only in part of the lists can be avoided for processing. For LSF, we selected the candidate document from the posting list of the most significant terms, which must be checked first to appear in all the other posting lists. Thus, the processed elements of the LSF are the same as those of the DAAT.

Here, we look at mainly the number of heap inserts of the three exhaustive traversal techniques. It is worth noting that the TAAT can just do a partial sort to extract the top-k results and then a full ranking on the results, which can avoid a large number of heap inserts. With conjunctive queries, the criteria are the same for LSF and DAAT; this is because the posting lists of the two are read in sequential order, which obtain the same candidate and the same threshold changes. With disjunctive queries, the number of heap inserts of LSF is almost 30% less than that of DAAT. In essence, this is mainly because LSF sorts the terms in descending frequency order and processes postings in important terms first. Thus, the threshold is significantly promoted and greatly reduces the number of candidates that are inserted into the result heap. This allows us to consider whether the advantage of LSF retrieval can be used for further query processing improvements by dynamic pruning techniques.

5.2 LSF dynamic pruning

As well-known dynamic pruning techniques, MaxScore and WAND differ in list sorting and partial scoring, but both achieve the best performance in query processing of large-scale document-sorted indexes. To evaluate the performance of our proposed LSF-based dynamic pruning techniques on document-sorted indexes, we present a detailed comparison of WAND, MaxScore, LSF_LO, and LSF_PS. We re-implemented the WAND approach using JAVA referring to the code provided by Dimopoulos et al. (2013) and the MaxScore approach based on the code available at https://github.com/s-j/laika. In the following subsection, we compare mainly two non-partial scoring methods and two partial scoring methods. We measure the performance according to the following criteria: average time per query, the numbers of chunks decoded, docids evaluated, calls to the scoring function, and candidates inserted into the result heap.

Table 3 shows the average query processing time of the four dynamic pruning techniques with different query lengths. MaxScore and WAND both achieve significant performance gains compared with the OR_DAAT technique, and MaxScore performs better due to the costly list sorting phase of WAND and its lack of consideration of term importance. LSF_LO achieves significant performance gains compared with the exhaustive traversal, which shows the effect of consideration of term importance and list omitting. We find that LSF_PS achieves better performance with term length larger than two compared with MaxScore. The reason is that the query terms of LSF-based techniques are sorted in descending order, and we select the candidate document in terms that are more important.

Table 3   Average query latency of different dynamic pruning techniques for different numbers of query terms

Algorithm Average query latency (ms)
Average 2 3 4 5 >5
WAND 47.3 30.6 44.6 58.4 68.3 103.7
MaxScore 35.6 26.7 33.3 38.8 47.7 63.1
LSF_LO 61.1 38.9 56.9 70.2 89.9 140.3
LSF_PS 34.4 28.3 33.2 37.8 45.9 58.6

New window| CSV


Second, we note that the performance gap between LSF_PS and LSF_LO increases with the increase of the number of query terms, and this also occurs between LSF_PS and WAND. With the increase of term length, there is significant improvement compared with MaxScore, due to the consideration of the LSF scheme. This phenomenon proves the importance of partial scoring in the LSF inverted index traversal. With more query terms, it causes more mis-scoring parts of the query terms, which can slow down the candidate document checking. Overall, our LSF-based dynamic pruning techniques achieve comparable query processing performance to the best DAAT-based dynamic pruning technique.

Table 4 shows other criteria for the four pruning techniques that provide an in-depth demonstration of the performance. The numbers of candidates inserted into the result heap are almost the same for the two DAAT-based techniques, and also for the two LSF-based techniques. The costly revisiting of posting lists makes LSF_LO process a few more elements than the other three techniques, including scoring functions called, docids evaluated, and blocks decoded, although most of them can be ignored by list omitting. However, the number of candidate documents inserted into the result heap is sharply reduced for LSF-based techniques compared with that of the DAAT-based techniques. This is mainly due to the important term selection in LSF. The threshold of the result heap of LSF-based techniques grows faster than that of the DAAT-based techniques, which avoids a large number of mis-inserts into the result heap. This phenomenon proves the superiority of LSF to DAAT in dynamic pruning techniques.

Table 4   The average number of processed elements of different dynamic pruning techniques with the number of results k = 10

Algorithm Nhi Nsc (×l03) Nde (×l03) Ncd
WAND 119.4 114.6 379.5 280.0
MaxScore 119.4 215.4 238.9 223.3
LSF_LO 83.4 258.6 393.9 324.0
LSF_PS 83.4 188.5 219.0 240.6

Nhi: number of heap inserts; Nsc: number of scorings called;

Nde: number of docids evaluated; Ncd: number of chunks decoded

New window| CSV


Furthermore, the best performance achieved by LSF_PS is significantly better on the three criteria when compared with LSF_LO using partial scoring techniques. The numbers of heap inserts, docids evaluated, and scoring functions called for LSF_PS are less than those of the best performance MaxScore, and the numbers of blocks decoded are almost the same. Although both LSF_PS and MaxScore use partial scoring techniques, the candidates selected as more important terms in LSF_PS make the termination condition come early. With these explicit results, we can conclude that the improvement for the techniques lies mainly in highly efficient LSF index traversal and partial scoring operations.

5.3 Extensions

Real search engines use ranking functions based on hundreds of features. However, such functions are quite expensive to evaluate. To achieve high efficiency, search engines commonly separate the ranking process into two or more phases (Wang et al., 2011). In the first phase, a very simple and fast ranking function is used to score the full postings that match the query and then return the top-k scored documents. In the following phases, more increasingly complicated ranking functions with more and more features are applied to documents that pass through the earlier phases. Thus, the later phases examine only a fairly small number of result candidates, and a significant amount of computation is still spent in the first phase. A simple and fast ranking function is adopted to obtain the top 100 or 1000 documents in the first phase. Then a complicated ranking function with hundreds of features is used to explicitly evaluate the results.

Table 5 shows the experimental results of the four dynamic pruning techniques as we increase the number of results for a first phase ranking. We find that the query latency of the four techniques increases with k and that LSF_PS achieves a slightly better stable performance than MaxScore with different k. One reason is that we focus on techniques in which the temporal data structure that can be influenced by k is avoided. Thus, the size of the result set influences only the number of heap inserts and does not affect any other counts, e.g., memory cost. With the increase of the number of results, we find that the gap between LSF_LO and MaxScore is reduced. This is because the threshold of the result heap keeps a lower value for a larger number of results, and the effect of partial scoring is reduced. However, the important term selection and list omitting make LSF-based dynamic pruning techniques stop early. In fact, we also note that the superiority of LSF_PS increases compared with MaxScore for a larger number of results.

Table 5   Average query latency of different dynamic pruning techniques with different numbers of results

Algorithm Average query latency (ms)
k=10 k=50 k=100 k=500 k=1000
WAND 47.3 54.4 62.4 76.6 86.3
MaxScore 35.6 45.5 51.5 66.6 79.0
LSF_LO 61.1 67.3 70.4 86.9 94.4
LSF_PS 34.4 41.5 45.1 63.7 71.1

New window| CSV


Fig. 4 illustrates the average latency of the pruning techniques as we increase the physical size of the fetched blocks (varying from 1 KB to 128 KB). As the results show, the query efficiency improves gradually as the block size becomes larger. The decrease in the measured latency between 1 KB and 128 KB blocks is 32.6% for LSF_LO and 25.0% for LSF_PS. The query efficiency becomes roughly stable when the size of block is larger than 16 KB. This can be explained by a decrease in random disk-seeks for the larger physical block size and the fact that when compressed data is split between two smaller physical blocks, it requires two block fetches to fetch the data chunk.

Fig. 4

Fig. 4   Average query latency of different dynamic pruning techniques with various physical block sizes (k = 10)


6 Conclusions

In this paper, we have proposed a novel exhaustive index traversal technique called largest scores first (LSF). It can reduce the memory consumption and candidate selection cost of existing methods at the expense of revisiting the posting lists of the remaining query terms. Preliminary analysis and implementation showed comparable performance between LSF and existing methods. To further reduce the number of postings that need to be revisited, we have also presented efficient dynamic pruning techniques based on LSF, including list omitting (LSF_LO) and partial scoring (LSF_PS). The two dynamic pruning techniques take advantage of the term importance consideration of LSF, which gives prior concern to more promising candidates. Experimental results with the TREC GOV2 collection showed that our LSF retrieval is better than the TAAT retrieval and is comparable to the DAAT retrieval, but has potential advantages over dynamic pruning techniques. Further experiments showed that our LSF-based pruning techniques reduce disjunctive query processing time by almost 27% over the WAND baseline and lead to slightly better results compared with the MaxScore baseline, while returning the same results as the exhaustive evaluation. With more query terms or more returned results, our LSF-based pruning schemes (including list omitting and partial scoring) showed more improvements compared with the two baselines.

Further investigations will extend to the LSF-based dynamic pruning techniques on the block-max indexes (Chakrabarti et al., 2011; Ding and Suel, 2011), an augmented structure that stores the piece-wise maximum score of each block in a posting list, to obtain a more exact estimate of the upper bound score of candidates. We believe that the adoption of block-max indexes is orthogonal to our work and can further improve the query processing performance of the LSF-based dynamic pruning techniques.

Reference

Anh VN, Moffat A .

Simplified similarity scoring using term ranks

2005, Proc.28th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. 226-233. doi: 10.1145/1076034.1076075

DOI:10.1145/1076034.1076075      URL     [Cited within: 2]

We propose a method for document ranking that combines a simple document-centric view of text, and fast evaluation strategies that have been developed in connection with the vector space model. The new method defines the importance of a term within a document qualitatively rather than quantitatively, and in doing so reduces the need for tuning parameters. In addition, the method supports very fast query processing, with most of the computation carried out on small integers, and dynamic pruning an effective option. Experiments on a wide range of TREC data show that the new method provides retrieval effectiveness as good as or better than the Okapi BM25 formulation, and variants of language models.

Anh VN, Moffat A .

Pruned query evaluation using pre-computed impacts

2006, Proc. 29th Annual ACM SIGIR Conf. on Research and Development in Information Retrieval. 372-379. doi: 10.1145/1148170.1148235

DOI:10.1145/1148170.1148235      URL     [Cited within: 1]

Exhaustive evaluation of ranked queries can be expensive, particularly when only a small subset of the overall ranking is required, or when queries contain common terms. This concern gives rise to techniques for dynamic query pruning, that is, methods for eliminating redundant parts of the usual exhaustive evaluation, yet still generating a demonstrably "good enough" set of answers to the query. In this work we propose new pruning methods that make use of impact-sorted indexes. Compared to exhaustive evaluation, the new methods reduce the amount of computation performed, reduce the amount of memory required for accumulators, reduce the amount of data transferred from disk, and at the same time allow performance guarantees in terms of precision and mean average precision. These strong claims are backed by experiments using the TREC Terabyte collection and queries.

Anh VN, Moffat A, . et al. .

Index compression using 64-bit words

Softw. Pract. Exper. 2010, 40 (2): 131-147. doi: 10.1002/spe.948

DOI:10.1002/spe.948      URL     [Cited within: 1]

Modern computers typically make use of 64-bit words as the fundamental unit of data access. However the decade-long migration from 32-bit architectures has not been reflected in compression technology, because of a widespread assumption that effective compression techniques operate in terms of bits or bytes, rather than words. Here we demonstrate that the use of 64-bit access units, especially in connection with word-bounded codes, does indeed provide the opportunity for improving the compression performance. In particular, we extend several 32-bit word-bounded coding schemes to 64-bit operation and explore their uses in information retrieval applications. Our results show that the Simple-8b approach, a 64-bit word-bounded code, is an excellent self-skipping code, and has a clear advantage over its competitors in supporting fast query evaluation when the data being compressed represents the inverted index for a large text collection. The advantages of the new code also accrue on 32-bit architectures, and for all of Boolean, ranked, and phrase queries; which means that it can be used in any situation. Copyright 漏 2010 John Wiley & Sons, Ltd.

Badue C, Ribeiro-Neto B, Baeza-Yates R, et al. .

Distributed query processing using partitioned inverted files

2001, Proc. 8th Int. Symp. on String Processing and Information Retrieval. 10-20. doi: 10.1109/SPIRE.2001.989733

DOI:10.1109/SPIRE.2001.989733      URL     [Cited within: 1]

Broder AZ, Carmel D, Herscovici M, . et al. .

Efficient query evaluation using a two-level retrieval process

2003, Proc. 12th Int. Conf. on Information and Knowledge Management. 426-434. doi: 10.1145/956863.956944

DOI:10.1145/956863.956944      URL     [Cited within: 1]

We present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. The efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. The amount of pruning can be controlled by the user as a function of time allocated for query evaluation. Experimentally, using the TREC Web Track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. At the heart of our approach there is an efficient implementation of a new Boolean construct called WAND or Weak AND that might be of independent interest.

Buckley C, Lewit AF .

Optimization of inverted vector searches

1985, Proc. 8th Annual Int.ACM SIGIR Conf. on Research and Development in Information Retrieval. 97-110. doi: 10.1145/253495.253515

DOI:10.1145/253495.253515      URL     [Cited within: 2]

A simple algorithm is presented for increasing the efficiency of information retrieval searches which are implemented using inverted files. This optimization algorithm employs knowledge about the methods used for weighting document and query terms in order to examine as few inverted lists as possible. An extension to the basic algorithm allows greatly increased performance optimization at a modest cost in retrieval effectiveness. Experimental runs are made examining several different term weighting models and showing the optimization possible with each.

Büttcher S, Clarke CLA .

Index compression is good, especially for random access

2007, Proc. 16th ACM Conf. on Information and Knowledge Management. 761-770. doi: 10.1145/1321440.1321546

DOI:10.1145/1321440.1321546      URL     [Cited within: 1]

Index compression techniques are known to substantially decrease the storage requirements of a text retrieval system. As a side-effect, they may increase its retrieval performance by reducing disk I/O overhead. Despite this advantage, developers sometimes choose to store index data in uncompressed form, in order to not obstruct random access into each index term's postings list. In this paper, we show that index compression does not harm random access performance. In fact, we demonstrate that, in some cases, random access into a term's postings list may be realized more efficiently if the list is stored in compressed form instead of uncompressed. This is regardless of whether the index is stored on disk or in main memory, since both types of storage - hard drives and RAM - do not support efficient random access in the first place.

Büttcher S, Clarke CLA, Cormack GV . Information Retrieval: Implementing and Evaluating Search Engines 2010, USAThe MIT Press

[Cited within: 3]

Chakrabarti K, Chaudhuri S, Ganti V .

Intervalbased pruning for top-k processing over compressed lists

2011, Proc. 27th Int. Conf. on Data Engineering. 709-720. doi: 10.1109/ICDE.2011.5767855

DOI:10.1109/ICDE.2011.5767855      [Cited within: 2]

Croft B, Metzler D, Strohman T . Search Engines: Information Retrieval in Practice 2010, USAAddison Wesley

[Cited within: 1]

Dean J .

Challenges in building large-scale information retrieval systems: invited talk

2009, Proc. 2nd ACM Int. Conf. on Web Search and Data Mining. 1 doi: 10.1145/1498759.1498761

DOI:10.1145/1498759.1498761      URL     [Cited within: 1]

Building and operating large-scale information retrieval systems used by hundreds of millions of people around the world provides a number of interesting challenges. Designing such systems requires making complex design tradeoffs in a number of dimensions, including (a) the number of user queries that must be handled per second and the response latency to these requests, (b) the number and size of various corpora that are searched, (c) the latency and frequency with which documents are updated or added to the corpora, and (d) the quality and cost of the ranking algorithms that are used for retrieval. In this talk I will discuss the evolution of Google's hardware infrastructure and information retrieval systems and some of the design challenges that arise from ever-increasing demands in all of these dimensions. I will also describe how we use various pieces of distributed systems infrastructure when building these retrieval systems. Finally, I will describe some future challenges and open research problems in this area.

Delbru R, Campinas S, Tummarello G .

Searching web data: an entity retrieval and high-performance indexing model

Web Semant. Sci. Serv. Agents World Wide Web2012, 10: 33-58. doi: 10.1016/j.websem.2011.04.004

DOI:10.1016/j.websem.2011.04.004      URL     [Cited within: 1]

This work has been part of the Sindice search engine project at the Digital Enterprise Research Institute (DERI), NUI Galway. The Sindice system currently maintains more than 200 million pages downloaded from the web and is being used actively by many researchers within and outside of DERI.

Dimopoulos C, Nepomnyachiy S, Suel T .

Optimizing top-k document retrieval strategies for block-max indexes

2013, Proc. 6th ACM Int. Conf. on Web Search and Data Mining. 113-122. doi: 10.1145/2433396.2433412

DOI:10.1145/2433396.2433412      URL     [Cited within: 4]

Large web search engines use significant hardware and energy resources to process hundreds of millions of queries each day, and a lot of research has focused on how to improve query processing efficiency. One general class of optimizations called early termination techniques is used in all major engines, and essentially involves computing top results without an exhaustive traversal and scoring of all potentially relevant index entries. Recent work in [9, 7] proposed several early termination algorithms for disjunctive top-k query processing, based on a new augmented index structure called Block-Max Index that enables aggressive skipping in the index. In this paper, we build on this work by studying new algorithms and optimizations for Block-Max indexes that achieve significant performance gains over the work in [9, 7]. We start by implementing and comparing Block-Max oriented algorithms based on the well-known Maxscore and WAND approaches. Then we study how to build better Block-Max index structures and design better index-traversal strategies, resulting in new algorithms that achieve a factor of 2 speed-up over the best results in [9] with acceptable space overheads. We also describe and evaluate a hierarchical algorithm for a new recursive Block-Max index structure.

Ding S, Suel T .

Faster top-k document retrieval using block-max indexes

2011, Proc. 34th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. 993-1002. doi: 10.1145/2009916.2010048

DOI:10.1145/2009916.2010048      URL     [Cited within: 6]

Large search engines process thousands of queries per second over billions of documents, making query processing a major performance bottleneck. An important class of optimization techniques called early termination achieves faster query processing by avoiding the scoring of documents that are unlikely to be in the top results. We study new algorithms for early termination that outperform previous methods. In particular, we focus on safe techniques for disjunctive queries, which return the same result as an exhaustive evaluation over the disjunction of the query terms. The current state-of-the-art methods for this case, the WAND algorithm by Broder et al. [11] and the approach of Strohman and Croft [30], achieve great benefits but still leave a large performance gap between disjunctive and (even non-early terminated) conjunctive queries. We propose a new set of algorithms by introducing a simple augmented inverted index structure called a block-max index. Essentially, this is a structure that stores the maximum impact score for each block of a compressed inverted list in uncompressed form, thus enabling us to skip large parts of the lists. We show how to integrate this structure into the WAND approach, leading to considerable performance gains. We then describe extensions to a layered index organization, and to indexes with reassigned document IDs, that achieve additional gains that narrow the gap between disjunctive and conjunctive top-k query processing.

Fontoura M, Josifovski V, Liu JH, et al. .

Evaluation strategies for top-k queries over memory-resident inverted indexes

Proc. VLDB Endow, 2011, 1213-1224.

URL     [Cited within: 2]

Abstract Top-k retrieval over main-memory inverted indexes is at the core of many modern applications: from large scale web search and advertising platforms, to text extenders and content management systems. In these systems, queries are evaluated using two major families of algorithms: documentat-a-time (DAAT) and term-at-a-time (TAAT). DAAT and TAAT algorithms have been studied extensively in the research literature, but mostly in disk-based settings. In this paper, we present an analysis and comparison of several DAAT and TAAT algorithms used in Yahoo! production platform for online advertising. The low-latency requirements of online advertising systems mandate memory-resident indexes. We compare the performance of several query evaluation algorithms using two real-world ad selection datasets and query workloads. We show how some adaptations of the original algorithms for main memory setting have yielded significant performance improvement, reducing running time and cost of serving by 60 % in some cases. In these results both the original and the adapted algorithms have been evaluated over memory-resident indexes, so the improvements are algorithmic and not due to the fact that the experiments used main memory indexes. 1.

Jiang K, Yang YX .

Exhaustive hybrid posting lists traversing technique

2015, Proc. 5th Int. Conf. on Intelligence Science and Big Data Engineering. 1-11. doi: 10.1007/978-3-319-23862-3_1

DOI:10.1007/978-3-319-23862-3_1      URL     [Cited within: 3]

A large amount of optimization techniques have been studied in addressing the performance challenges of web search engines, but still leave much room for further improvement. In this paper, we focus on the inverted index traversal techniques, which make directly scans of the posting lists with different loop schemes, providing preliminary results for a complicated ranking procedure. We propose a novel exhaustive index traversal technique called hybrid-scoring at a time (HAAT) on document-ordered indexes, which can reduce memory consumption and candidate selection cost of existing document at a time (DAAT) and term at a time (TAAT) at the expense of revisiting the posting lists of the remaining query terms. Preliminary analysis show comparable computational complexity between HAAT and existing methods. Experimental results with the TREC GOV2 collection show that our approach is comparable with the existing DAAT baseline and considerable performance gains compared to TAAT baseline.

Jiang K, Song XS, Yang YX .

Performance evaluation of inverted index traversal techniques

2014, Proc. 17th Int. Conf. on Computational Science and Engineering. 1715-1720. doi: 10.1109/CSE.2014.315

DOI:10.1109/CSE.2014.315      URL     [Cited within: 1]

Large search engines process thousands of queries per second over billions of documents, making query processing a major performance bottleneck. In this paper, we study an important class of index traversal techniques called dynamic pruning, which can efficiently reduce the query computational resources. We give brief overviews of index structure and skipping structure that used to quicken the term-centric and document centric index traversal strategies. Due to the fact that document centric is more efficient than term-centric for large indexes, we review several state-of-the-art dynamic pruning approaches based on document-centric scoring. We then present explicit experimental comparisons of the above index traversal techniques on TREC GOV2 datasets with detailed analytical conclusions. Finally, we provide the prospective future study of dynamic pruning with large scale of indexes.

Jonassen S, Bratsberg SE .

Efficient compressed inverted index skipping for disjunctive text-queries

2011, Proc. 33rd European Conf. on Advances in Information Retrieval. 530-542. doi: 10.1007/978-3-642-20161-5_53

DOI:10.1007/978-3-642-20161-5_53      URL     [Cited within: 6]

In this paper we look at a combination of bulk-compression, partial query processing and skipping for document-ordered inverted indexes. We propose a new inverted index organization, and provide an updated version of the MaxScore method by Turtle and Flood and a skipping-adapted version of the space-limited adaptive pruning method by Lester et al. Both our methods significantly reduce the number of processed elements and reduce the average query latency by more than three times. Our experiments with a real implementation and a large document collection are valuable for a further research within inverted index skipping and query processing optimizations.

Lacour P, Macdonald C, Ounis I .

Efficiency comparison of document matching techniques

2008, Proc. European Conf. on Information Retrieval. 37-46.

URL     [Cited within: 3]

Inverted indices are one of the most commonly used tech-niques to search very large document collections. While the typical size of web document collections is constantly increasing, users have come to expect a very quick response time, and accurate search results. Hence, to make best use of available hardware resources, efficient and effective retrieval techniques are desirable. We review several state-of-the-art ap-proaches for matching documents to query terms, based on term-centric and document-centric scoring. We test the techniques using three modern Web Information Retrieval (IR) test collections, and conclude in terms of the trade-off between retrieval effectiveness and efficiency.

Lester N, Moffat A, Webber W, et al. .

Spacelimited ranked query evaluation using adaptive pruning

2005, Proc. 6th Int. Conf. on Web Information Systems Engineering. 470-477. doi: 10.1007/11581062_37

DOI:10.1007/11581062_37      URL     [Cited within: 1]

Evaluation of ranked queries on large text collections can be costly in terms of processing time and memory space. Dynamic pruning techniques allow both costs to be reduced, at the potential risk of decreased retrieval effectiveness. In this paper we describe an improved query pruning mechanism that offers a more resilient tradeoff between query evaluation costs and retrieval effectiveness than do previous pruning approaches.

Macdonald C, Ounis I, Tonellotto N .

Upperbound approximations for dynamic pruning

ACM Trans. Inform. Syst. 2011, 29 (4): 17.1-17.28. doi: 10.1145/2037661.2037662

DOI:10.1145/2037661.2037662      URL     [Cited within: 1]

Dynamic pruning strategies for information retrieval systems can increase querying efficiency without decreasing effectiveness by using upper bounds to safely omit scoring documents that are unlikely to make the final retrieved set. Often, such upper bounds are pre-calculated at indexing time for a given weighting model. However, this precludes changing, adapting or training the weighting model without recalculating the upper bounds. Instead, upper bounds should be approximated at querying time from various statistics of each term to allow on-the-fly adaptation of the applied retrieval strategy. This article, by using uniform notation, formulates the problem of determining a term upper-bound given a weighting model and discusses the limitations of existing approximations. Moreover, we propose an upper-bound approximation using a constrained nonlinear maximization problem. We prove that our proposed upper-bound approximation does not impact the retrieval effectiveness of several modern weighting models from various different families. We also show the applicability of the approximation for the Markov Random Field proximity model. Finally, we empirically examine how the accuracy of the upper-bound approximation impacts the number of postings scored and the resulting efficiency in the context of several large Web test collections.

Manning CD, Raghavan P, Schütze H . Introduction to Information Retrieval 2008, USA Cambridge University Press, Cambridge

[Cited within: 1]

Melink S, Raghavan S, Yang B, et al. .

Building a distributed full-text index for the Web

2001, Proc. 10th Int. Conf. on World Wide Web. 396-406. doi: 10.1145/371920.372095

DOI:10.1145/371920.372095      URL     [Cited within: 1]

We identify crucial design issues in building a distributed inverted index for a large collection of Web pages. We introduce a novel pipelining technique for structuring the core index-building system that substantially reduces the index construction time. We also propose a storage scheme for creating and managing inverted files using an embedded database system. We suggest and compare different strategies for collecting global statistics from distributed inverted indexes. Finally, we present performance results from experiments on a testbed distributed Web indexing system that we have implemented.

Moffat A, Zobel J .

Self-indexing inverted files for fast text retrieval

ACM Trans. Inform. Syst. 1996, 14 (4): 349-379. doi: 10.1145/237496.237497

DOI:10.1145/237496.237497      URL     [Cited within: 3]

ABSTRACT Query-processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time required. Here we show that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the self-indexing strategy adds less than 20% to the size of the compressed inverted file, which itself occupies less than 10% of the indexed text, yet can reduce processing time for Boolean queries of 5-10 terms to under one fifth of the previous cost. Similarly, ranked queries of 40-50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.

Ounis I, Amati G, Plachouras V, et al. .

Terrier: a high performance and scalable information retrieval platform

2006, Proc. OSIR Workshop. 18-25.

URL     [Cited within: 1]

In this paper, we describe Terrier, a high performance and scalable search engine that allows the rapid development of large-scale retrieval applications. We focus on the open- source version of the software, which provides a comprehen- sive, exible, robust, and transparent test-bed platform for research and experimentation in Information Retrieval (IR).

Puppin D, Silvestri F, Perego R , et al. .

Tuning the capacity of search engines: load-driven routing and incremental caching to reduce and balance the load

ACM Trans. Inform. Syst. 2010, 28(2): 5.1-5.36. doi: 10.1145/1740592.1740593

DOI:10.1145/1740592.1740593      URL     [Cited within: 1]

This article introduces an architecture for a document-partitioned search engine, based on a novel approach combining collection selection and load balancing, called load-driven routing. By exploiting the query-vector document model, and the incremental caching technique, our architecture can compute very high quality results for any query, with only a fraction of the computational load used in a typical document-partitioned architecture. By trading off a small fraction of the results, our technique allows us to strongly reduce the computing pressure to a search engine back-end; we are able to retrieve more than 2/3 of the top-5 results for a given query with only 10&percnt; the computing load needed by a configuration where the query is processed by each index partition. Alternatively, we can slightly increase the load up to 25&percnt; to improve precision and get more than 80&percnt; of the top-5 results. In fact, the flexibility of our system allows a wide range of different configurations, so as to easily respond to different needs in result quality or restrictions in computing power. More important, the system configuration can be adjusted dynamically in order to fit unexpected query peaks or unpredictable failures. This article wraps up some recent works by the authors, showing the results obtained by tests conducted on 6 million documents, 2,800,000 queries and real query cost timing as measured on an actual index.

Silvestri F, Venturini R .

VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming

2010, Proc. 19th ACM Int. Conf. on Information and Knowledge Management. 1219-1228. doi: 10.1145/1871437.1871592

DOI:10.1145/1871437.1871592      URL     [Cited within: 1]

Encoding lists of integers efficiently is important for many applications in different fields. Adjacency lists of large graphs are usually encoded to save space and to improve decoding speed. Inverted indexes of Information Retrieval systems keep the lists of postings compressed in order to exploit the memory hierarchy. Secondary indexes of DBMSs are stored similarly to inverted indexes in IR systems. In this paper we propose Vector of Splits Encoding (VSEncoding), a novel class of encoders that work by optimally partitioning a list of integers into blocks which are efficiently compressed by using simple encoders. In previous works heuristics were applied during the partitioning step. Instead, we find the optimal solution by using a dynamic programming approach. Experiments show that our class of encoders outperform all the existing methods in literature by more than 10% (with the exception of Binary Interpolative Coding with which they, roughly, tie) still retaining a very fast decompression algorithm.

Strohman T, Croft WB .

Efficient document retrieval in main memory

2007, Proc. 30th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. 175-182. doi: 10.1145/1277741.1277774

DOI:10.1145/1277741.1277774      URL     [Cited within: 3]

Disk access performance is a major bottleneck in traditional information retrieval systems. Compared to system memory, disk bandwidth is poor, and seek times are worse. We circumvent this problem by considering query evaluation strategies in main memory. We show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness. We evaluate our techniques using Galago, a new retrieval system designed for efficient query processing. Our system achieves a 69% improvement in query throughput over previous methods.

Strohman T, Turtle H, Croft W.B .

Optimization strategies for complex queries

2005, Proc. 28th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. 219-225. doi: 10.1145/1076034.1076074

DOI:10.1145/1076034.1076074      URL     [Cited within: 1]

Previous research into the efficiency of text retrieval systems has dealt primarily with methods that consider inverted lists in sequence; these methods are known as term-at-a-time methods. However, the literature for optimizing documentat- a-time systems remains sparse. We present an improvement to the max score optimization, which is the most efficient known document-at-a-time scoring method. Like max score, our technique, called term bounded max score, is guaranteed to return exactly the same scores and documents as an unoptimized evaluation, which is particularly useful for query model research. We simulated our technique to explore the problem space, then implemented it in Indri, our large scale language modeling search engine. Tests with the GOV2 corpus on title queries show our method to be 23% faster than max score alone, and 61% faster than our document-at-a-time baseline. Our optimized query times are competitive with conventional termat- a-time systems on this year&#039;s TREC Terabyte task.

Turtle H, Flood J .

Query evaluation: strategies and optimizations

Inform. Process. Manag. 1995, 31(6): 831-850. doi: 10.1016/0306-4573(95)00020-H

DOI:10.1016/0306-4573(95)00020-H      URL     [Cited within: 5]

This paper discusses the two major query evaluation strategies used in large text retrieval systems and analyzes the performance of these strategies. We then discuss several optimization techniques that can be used to reduce evaluation costs and present simulation results to compare the performance of these optimization techniques when evaluating natural language queries with a collection of full text legal materials.

Wang LD, Lin J, Metzler D .

A cascade ranking model for efficient ranked retrieval

2011, Proc. 34th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. 105-114. doi: 10.1145/2009916.2009934

DOI:10.1145/2009916.2009934      URL     [Cited within: 1]

There is a fundamental tradeoff between effectiveness and efficiency when designing retrieval models for large-scale document collections. Effectiveness tends to derive from sophisticated ranking functions, such as those constructed using learning to rank, while efficiency gains tend to arise from improvements in query evaluation and caching strategies. Given their inherently disjoint nature, it is difficult to jointly optimize effectiveness and efficiency in end-to-end systems. To address this problem, we formulate and develop a novel cascade ranking model, which unlike previous approaches, can simultaneously improve both top k ranked effectiveness and retrieval efficiency. The model constructs a cascade of increasingly complex ranking functions that progressively prunes and refines the set of candidate documents to minimize retrieval latency and maximize result set quality. We present a novel boosting algorithm for learning such cascades to directly optimize the tradeoff between effectiveness and efficiency. Experimental results show that our cascades are faster and return higher quality results than comparable ranking models.

Zobel J, Moffat A .

Inverted files for text search engines

ACM Comput. Surv 2006, 38(2): 6.1-6.56. doi: 10.1145/1132956.1132959

DOI:10.1145/1132956.1132959      URL     [Cited within: 1]

The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolidated in textbooks, many specific techniques are not widely known or the textbook descriptions are out of date. In this tutorial, we introduce the key techniques in the area, describing both a core implementation and how the core can be enhanced through a range of extensions. We conclude with a comprehensive bibliography of text indexing literature.

Zukowski M, Heman S, Nes N , et al. .

Super-scalar RAM-CPU cache compression

2006, Proc. 22nd Int. Conf. on Data Engineering. 59 doi: 10.1109/ICDE.2006.150

DOI:10.1109/ICDE.2006.150      URL     [Cited within: 2]

High-performance data-intensive query processing tasks like OLAP, data mining or scientific data analysis can be severely I/O bound, even when high-end RAID storage systems are used. Compression can alleviate this bottleneck only if encoding and decoding speeds significantly exceed RAID I/O bandwidth. For this purpose, we propose three new versatile compression schemes (PDICT, PFOR, and PFOR-DELTA) that are specifically designed to extract maximum IPC from modern CPUs. We compare these algorithms with compression techniques used in (commercial) database and information retrieval systems. Our experiments on the MonetDB/X100 database system, using both DSM and PAX disk storage, show that these techniques strongly accelerate TPC-H performance to the point that the I/O bottleneck is eliminated.

浙ICP备14002560号-5
版权所有 © Frontiers of Information Technology & Electronic Engineering
地址:浙江省杭州市浙大路38号 邮编:310027
联系电话:+86-571-87952783/87952276 E-mail: jzus_zzy@zju.edu.cn; jzus@zju.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发

/