• 30Jan

    Hi everyone.

    I just want to inform that I’ve taken some further steps to describe and provide my master thesis. I have written a page (http://asbjorn.fellinghaug.com/blog/master-thesis/) who’s goal is to summeraize and further describe the overall goals and design of my master thesis.

    I will also – in time – further work on the bigram index, as I want to see its full working potential one a more real-life collection. In the beginning I will use the dumps provided by the wonderful Wikipedia foundation. These dumps are several gigabytes with pure text (and some metadata). I realize that the content of each wikipedia article may not fully reflect typical websites on the internet, but it is a start. The next step I’ve made myself is to find a sufficiently large website, and then index all the data on it. Then, to check how the bigram index performs on it.

    I will most likely keep further developments in the Java programming language, as it is the language which Apache Lucene is written in. However, I’m also quite interessted in writing a Python analyzer for the PyLucene package (Python port of Lucene).

  • 06Jan

    Hi everyone.

    Every now and then I get somewhat annoyed by the fact that I add temporary or “unwanted” files to a subversion repository. These files may be just temporary files, like *.tmp, which has no value of being kept in a repository, or compiled python files *.pyc, etc.

    The common characteristics is that they are generally unwanted, and that they can easily be removed from the SVN repository. A common approach towards this issue is to set a property named “svn:ignore” on a directory, or the whole directory structure. This can be achieved with this command:

    $# svn propset svn:ignore "*.tmp" .

    where the single-dot at the end signalize the standing directory. However, this must be performed for each subversion project, which can get annoying in time. I’ve recently discovered the possibility of setting global subversion settings for my own user. The per-user subversion settings file is located on $HOME/.subversion/config. In that file there is a section named “[miscellany]“, which holds a variable named “global-ignores”. This variable can hold multiple ignore statements which will apply for all the svn checkouts you may work on (given you are using this user).

    This subversion setting file also contains many more exciting options, such as automatic properties which apply to certain files. Have a look at the end for the $HOME/.subversion/config file and notice some of the predefined settings.

    A tip based on personal experience is to hook the current files to the global-ignores variable:

    *.pyc # python byte compiled code
    *.swp # vim swap file
    *.tmp # general temp files