Sunday, December 26, 2004

Lucene4c 0.01: Inching Towards Usefulness

When I picked up lucene4c again a week or two back I told myself that I'd release the first version as soon as I had something that could print out all the documents in a segment that contain a particular term.

This means I need to be able to parse the segments file to figure out what the name of the segments are, the terminfos file and it's index to find out info about the term you asked for, the frequencies file to give the list of documents that contain that term, and finally the field data file and its index so we can print out the field for each document that contains the identifier for the doc.

I had sort of figured it would take me longer, but I made it to that milestone tonight, so the first release, version 0.01, is now available at http://electricjellyfish.net/garrett/lucene4c/lucene4c-0.01.tar.gz.

After building the source you should have a little program named 'lcn' in the src/cmdline directory that will allow you to play around with an index. Specifically, you can do something like this:

$ ./src/cmdline/lcn termdocs test/data/index _c7 jsp contents path
/Users/rooneg/Hacking/lucene4c/jakarta-lucene/src/CVS/Entries
/Users/rooneg/Hacking/lucene4c/jakarta-lucene/src/jsp/CVS/Repository
/Users/rooneg/Hacking/lucene4c/jakarta-lucene/src/jsp/results.jsp
/Users/rooneg/Hacking/lucene4c/jakarta-lucene/src/jsp/WEB-INF/CVS/Repository
$

This just searched the test index for the term 'jsp' in the field 'contents' and for each result printed out the field 'path'. You should also be able to use it on other lucene indices, as long as they don't use the compound segment format, since I haven't even started work on the compound stuff yet.

If you're interested in the project please feel free to download it and kick the tires, I'd appreciate any feedback you might have.