Hi everyone.
I just want to inform that I’ve taken some further steps to describe and provide my master thesis. I have written a page (http://asbjorn.fellinghaug.com/blog/master-thesis/) who’s goal is to summeraize and further describe the overall goals and design of my master thesis.
I will also – in time – further work on the bigram index, as I want to see its full working potential one a more real-life collection. In the beginning I will use the dumps provided by the wonderful Wikipedia foundation. These dumps are several gigabytes with pure text (and some metadata). I realize that the content of each wikipedia article may not fully reflect typical websites on the internet, but it is a start. The next step I’ve made myself is to find a sufficiently large website, and then index all the data on it. Then, to check how the bigram index performs on it.
I will most likely keep further developments in the Java programming language, as it is the language which Apache Lucene is written in. However, I’m also quite interessted in writing a Python analyzer for the PyLucene package (Python port of Lucene).
