• 30Jan

    Hi everyone.

    I just want to inform that I’ve taken some further steps to describe and provide my master thesis. I have written a page (http://asbjorn.fellinghaug.com/blog/master-thesis/) who’s goal is to summeraize and further describe the overall goals and design of my master thesis.

    I will also – in time – further work on the bigram index, as I want to see its full working potential one a more real-life collection. In the beginning I will use the dumps provided by the wonderful Wikipedia foundation. These dumps are several gigabytes with pure text (and some metadata). I realize that the content of each wikipedia article may not fully reflect typical websites on the internet, but it is a start. The next step I’ve made myself is to find a sufficiently large website, and then index all the data on it. Then, to check how the bigram index performs on it.

    I will most likely keep further developments in the Java programming language, as it is the language which Apache Lucene is written in. However, I’m also quite interessted in writing a Python analyzer for the PyLucene package (Python port of Lucene).

  • 30Aug
    Categories: java, lucene, school Comments: 0

    Hi everyone.

    I have now rewritten some items in the source code of my master thesis, in addition to write some javadoc to make it more comprehensible. So, I will now publish the whole code – a lot later than initial planned though. I’m not however totally satisfied with the final code, since it may give the impression that it is a “run-and-play” code, which it is not. Also, I would recommend reading my master thesis, as a lot of the concepts in the source code is in much more extent defined there.

    I would also like to emphasize that the important thing in the source code is the DocumentAnalyzer.java#PhraseFilter3, which is responsible for manipulating the Lucene index into promoting phrase searching capabilities, as discussed in my master thesis.

    The code is available in both tar.gz and zip compression: