A guide to searching fields, incremental indexing, parameter control and more…
There are different text retrieval toolkits out there that you can use to build search engines or simply test your new search algorithm. I have used Lucene, Lemur and Terrier. Lucene is a snap when it comes to building an application layer over the search functionality. It is fast and easy to understand and very suitable if you are looking to build a commercial app. The downside is, that it has not adopted some of the state of the art models and still relies on basic boolean retrieval and basic Vector Space model. I recently noticed a BM25 implementation for Lucene floating around, so you may want to check that out. As for Lemur, I find it hard to modify and manage so my experience with Lemur has been limited. My experience with Terrier has been positive. I can say that it is a pretty well thought out toolkit. Even as a newbie to Terrier I was able to figure out the indexing and querying mechanism pretty quickly. It took me some time to learn the internal workings as some of the essential information about the toolkit usage were kind of scattered. This is a compilation of some essential information about the toolkit that I have learnt over the course of exploration.
The terrier documentation does not mention much about searching specific fields. But its pretty simple. Assuming you have indexed your documents using the TRECCollection format:
- “fieldname:query” ==> queries just the field.
- “query1 fieldname:query2” ==> if query 1 matches a document, a field level check is done with query 2
Boosting Weights if Term occurs in Specific Field
In the terrier.properties file insert the following:
- FieldTags.process=tag1,tag2,tag3 #tags that should have score boosting
- field.modifiers=boost1,boost2,boost #boost scores corresponding to tag1,tag2,tag3
- matching.tsms=FieldScoreModifier #this needs to be set for the field specific score boosting to work.
Incremental Indexing with Terrier:
Terrier currently does not support incremenal indexing, but there is a work around to this. Essentially you create the index of the new files separately, then you merge the new structure and the old structure. [ see solution ]
Controlling various parameters in Terrier:
I have discovered that there are several ways to control the different variables in Terrier. The first method is to set these variables in your code itself. For example, you can set the path of the index files this way: ApplicationSetup.TERRIER_INDEX_PATH=”some path”. ApplicationSetup is the file that holds most of the parameter settings for indexing and retrieval. The parameter values are actually set based on what you provide in the terrier.properties file. Since most fields in ApplicationSetup are static, it is easy to change these fields according to your needs overriding what was read out of terrier.properties. Another way of configuring parameters is of course through the properties file itself. The terrier.properties file should hold all the parameters that you would like to control. There are lots of different useful properties that you may not be aware of. This is a list of all the properties that you can set in Terrier.
Some useful links:
- Terrier’s Homepage
- Terrier toolkit documentation
- Related Publications
- List of configurable properties in Terrier