Sunday, December 19, 2004

Lucene4c Progress

So one of my "work on it when I'm bored with other things" projects is lucene4c, a port of the Lucene search engine from Java to C, using the Apache Portable Runtime to make writing in C more sane and portable.

Why would someone want to do such a thing? It seems like an awful lot of work, for not much gain, since the Java version of Lucene works damn well already.

Well, personally I want to be able to write an Apache module that lets you search over a Lucene index, and it's a lot simpler to do that in C than it is with Java. I'm also interested in the internals of search engines, and implementing the guts of one seems like a good way to learn. Plus, it's a neat way to kill some time on a lazy weekend at home.

Anyway, I had stopped working on it a few months ago because I got fed up with trying to track down a particular bug. This weekend I started playing around with it again, just for fun, and figured out the problem (allocating a file in the wrong pool, when I later tried to read from it the memory had already been reused) that had stopped me before. With that obstacle out of the way I'm back in business, making slow but steady progress.

My first goal is to be able to read a Lucene index, created by the Java version of Lucene. I figure a read-only version of lucene4c is at least sort of useful, and it's a lot easier than implementing the indexing side of things.

At this point I'm getting within sight of functionality. I can parse the segments file, the fieldinfos file, and the terminfos index file, although I need some tests for the tii stuff because it's likely that some of that isn't working correctly.

My next step is code to read the terminfos file (which shouldn't be that bad, since I've already got code to do much of that work from parsing the terminfos index file) and then the frequency file. With those two steps out of the way I'll in theory be able to answer questions like "what documents did the word lucene appear in", which is nice since it would at least approximate some sembalance of searching functionality.

A side issue I've found along the way is that the fileformats documentation, while decent, is not entirely up to date. I'm making a concerted effort to note places where reality seems to differ from the documentation, and at some point I hope to send some patches upstream to fix the problems.

Instrumenting the low level input stream classes has turned out to be my best way of figuring out exactly what's going on, it's one thing to trace through the Java code by hand, but actually having something say "every 10 terminfos we read this extra VInt" is a fantastic way to get pointed right to the spot you misread the Java code.

Anyway, back to parsing the terminfos file...