My master thesis regards the task to enhance phrase searching in large text databases. By enhancing I here mean to gain even better phrase searching capabilities, as the straightforward approach is very demanding in terms of computing power. As a result of the very high demand for computing power, the time it takes to handle a phrase search request increases, which is one of the worst things in a search engine. The common approach is to use simple mechanisms and manipulate the phrase query into a bit more “friendly” query. Note that my master thesis mainly aims at enhancing the phrase query throughput for search engines, and the indexing process is not thoroughly considered.
Often, phrase queries involves much more computation than a regular boolean query. A boolean query is typically a query such as: norway python group. Such a query would, consisting of single terms, would imply a boolean query. Search engines nowadays has a standard boolean operator which would be used if no other operator is defined in the query. If standarized on the AND operator, the query above would be equal to this: norway AND python AND group. These kinds of queries is the most common ones, and provides a generally good retrieval performance. It is computionally fast to perform, and the recall would quite possibly be large. However, due to the fact that the query may represent only a small subset of the documents you may seek, it would not provide good enough precision. A document containing those query terms may be found, but the content would not reflect your information need. Thus there is a need to better defined your information need.
Instead of using boolean queries, an increasing popular approach is to use phrase queries. A phrase query is basically a query containing a subset of a sentence. For instance we may search a web page in which we know there is a given sentence (title, sub-title, author, etc) in. Thus, we can define our phrase query as this sentence, whereas each word is in a given order and is also evaluated in that particular order.
A famous example is the Shakespear quote: “to be or not to be”. Each term is in a given order, and is thus evaluated based on that. A boolean query of the same terms would give a significant larger retrieval list than this phrase query. This is due to the characteristics of the terms (very common terms – stopwords).
Phrase queries are computationally expensive in that they require more metadata from the index, and complex computation to connect the terms (the order of the query terms). The main reason for its increase in size is that each unique term in the index vocabulary has quite large posting lists. Especially “common terms”, such as “is, a, the, to, be, or, not, ..”. Since a phrase index must contain all terms, even the common ones, the index is much larger than a boolean index (where stopwords is often removed). And, in modern computers there is usually one significant bottleneck – the I/O systems. Reading large chunks of data from the hard drive is time consuming, and thus the response time increases. A phrase index is very large (given a large document collection), therefore requiring more from the underlying systems, such as I/O bandwidth. In my master thesis, I have taken these characteristics into account, and defined a new kind of index which provides faster phrase search capabilities (in response time) without the loss of precision. I have named this new index as the Bigram index.
The Bigram index utilize the characteristics of n-grams (at word level, not character) and stopwords, and from there on constructs a new kind of index which provides significant efficiency gains for phrase queries. I recommend having a look at my master thesis (PDF) here!
[Will add more information here / under construction]

“Phrase searching in text indexes” by Asbjørn Alexander Fellinghaug is licensed under a Creative Commons Attribution 3.0 Norway License.
Keywords: lucene, phrase query, bigram, ngram, search engine, inverted index, index, query
