Sunday, December 26, 2004

Lucene4c 0.01: Inching Towards Usefulness

When I picked up lucene4c again a week or two back I told myself that I'd release the first version as soon as I had something that could print out all the documents in a segment that contain a particular term.

This means I need to be able to parse the segments file to figure out what the name of the segments are, the terminfos file and it's index to find out info about the term you asked for, the frequencies file to give the list of documents that contain that term, and finally the field data file and its index so we can print out the field for each document that contains the identifier for the doc.

I had sort of figured it would take me longer, but I made it to that milestone tonight, so the first release, version 0.01, is now available at

After building the source you should have a little program named 'lcn' in the src/cmdline directory that will allow you to play around with an index. Specifically, you can do something like this:

$ ./src/cmdline/lcn termdocs test/data/index _c7 jsp contents path

This just searched the test index for the term 'jsp' in the field 'contents' and for each result printed out the field 'path'. You should also be able to use it on other lucene indices, as long as they don't use the compound segment format, since I haven't even started work on the compound stuff yet.

If you're interested in the project please feel free to download it and kick the tires, I'd appreciate any feedback you might have.

A Great Xmas Present

I think of all the presents I got this year, my favorite is the Amazon book review Ben Collins-Sussman wrote for Practical Subversion.

When I originally started work on the book my intention was to take a different approach from the existing Subversion documentation, targetting a more experienced user as opposed to the newbie. I figured that since the existing documentation was already there filling the newbie role, there would be room for a book that didn't spend a ton of time teaching the basics of version control, and just jumped right to the stuff you need to know to really get going with Subversion

According to Ben's review I appear to have succeeded.

I've seen a number of reviews of the book so far, and I've enjoyed reading all of them, but I think this one is my favorite, both because it tells me that I accomplished my original goal and because it's always great to hear that people you respect like your work.

Thanks Ben!

Wednesday, December 22, 2004

Giving to Open Source

Most of the time, when I "contribute" to an open source project it's in the form of a bug report, or a patch, or helping answer user questions on a mailing list or IRC. These are all important contributions, and I'm happy that I have the time and ability to do so, but that's not the kind of contribution I'm going to talk about today.

In this day and age it takes more than bug reports, source code, and user support to keep a project going, it takes money. Money for bandwidth, hardware, for (as much as we wish it wasn't so) lawyers, and for a hundred other things that are needed to keep the wheels going round. Below a certain size these sort of things can be provided by the developers, or donated by a generous employer or user, but eventually you hit the point where need exceeds availability.

This doesn't even consider the fact that there are jobs in any large open source project that don't get done because the people who are able to do them can't afford to take the time off from their paying job to do them.

Fortunately, there are a number of organizations out there who are dedicated to providing financial support to these kind of projects. Personally I have experience with three of them, the FreeBSD Foundation, the Perl Foundation, and the Apache Software Foundation. At various points over the past few years I've given money to each of these groups, and I expect I will continue to do so in the future.

What do I get in return for these contributions? Well, the FreeBSD foundation funds development on my operating system of choice, allowing developers access to hardware and in some cases direct funding for work that would not otherwise get done. The Perl Foundation provides support for various projects related to the Perl programming language, including a wide variety of development grants that fund interesting new work. The ASF provides legal and technical infrastructure for a huge number of software projects.

I do admit, it would be nice if some of these groups were a bit more vocal about exactly what they're doing, but honestly I think they're doing a decent job at that, and in the end they are funding good work that needs to be done, so I'm happy to contribute, and if you're able to do the same I hope you do as well.

Sunday, December 19, 2004

Lucene4c Progress

So one of my "work on it when I'm bored with other things" projects is lucene4c, a port of the Lucene search engine from Java to C, using the Apache Portable Runtime to make writing in C more sane and portable.

Why would someone want to do such a thing? It seems like an awful lot of work, for not much gain, since the Java version of Lucene works damn well already.

Well, personally I want to be able to write an Apache module that lets you search over a Lucene index, and it's a lot simpler to do that in C than it is with Java. I'm also interested in the internals of search engines, and implementing the guts of one seems like a good way to learn. Plus, it's a neat way to kill some time on a lazy weekend at home.

Anyway, I had stopped working on it a few months ago because I got fed up with trying to track down a particular bug. This weekend I started playing around with it again, just for fun, and figured out the problem (allocating a file in the wrong pool, when I later tried to read from it the memory had already been reused) that had stopped me before. With that obstacle out of the way I'm back in business, making slow but steady progress.

My first goal is to be able to read a Lucene index, created by the Java version of Lucene. I figure a read-only version of lucene4c is at least sort of useful, and it's a lot easier than implementing the indexing side of things.

At this point I'm getting within sight of functionality. I can parse the segments file, the fieldinfos file, and the terminfos index file, although I need some tests for the tii stuff because it's likely that some of that isn't working correctly.

My next step is code to read the terminfos file (which shouldn't be that bad, since I've already got code to do much of that work from parsing the terminfos index file) and then the frequency file. With those two steps out of the way I'll in theory be able to answer questions like "what documents did the word lucene appear in", which is nice since it would at least approximate some sembalance of searching functionality.

A side issue I've found along the way is that the fileformats documentation, while decent, is not entirely up to date. I'm making a concerted effort to note places where reality seems to differ from the documentation, and at some point I hope to send some patches upstream to fix the problems.

Instrumenting the low level input stream classes has turned out to be my best way of figuring out exactly what's going on, it's one thing to trace through the Java code by hand, but actually having something say "every 10 terminfos we read this extra VInt" is a fantastic way to get pointed right to the spot you misread the Java code.

Anyway, back to parsing the terminfos file...

Sunday, December 12, 2004

PDF Books

I'm really starting to like the recent trend towards publishers making books available in both PDF and Dead Tree form. It's a refreshing change of pace when compared to the regular DRM ladden crap we get from the entertainment industry, I'm glad that at least the technical book publishers appear to be a reasonable bunch with regard to such things.

Specifically, I've recently bought both Programming Ruby and Lucene In Action in both PDF and Dead Tree form, and I've been quite pleased with the results.

The way it's worked is simple, you order the book from the publisher like you normally would, and for either slightly more or the same amount you would normally spend in a bookstore you get the hard copy shipped to you and a personalized PDF (with some kind of "this pdf prepared for <your name here>" thing on each page) prepared so you can download it immediately.

For technical books this is fantastic. If I order a tech book online I'm probably interested in it right now, so the PDF means I can get some instant gratification, which is a big plus. Then, later on I get the text version which, lets face it, is much nicer to read. In the long run though, I'll always have the PDF on my laptop so I can look something up if/when I don't happen to have the hard copy on me.

As for the DRM issue, I like to think that the personalization of the PDF is a nice way to keep people from passing them around on file sharing systems, although it's possible that's just me being overly naive. Time will tell I suppose.

In any event, I'm pleased with the few books I've purchased in this manner so far, and I plan on doing so again in the future if I'm given the opportunity.

Now back to reading about Lucene...

Sunday, December 5, 2004

rooneg at

It took a little while for the paperwork to go through, but today my account was set up, and I committed my first changes to the APR.

Its kind of cool being a part of something like that, I've thought the project was an important one for some time, and I'm happy to have the chance to contribute directly to it. Now I'm just hoping to not let down the people who decided I was worth giving commit access to. I figure the first "oh my $DIETY, you committed what!?!?! there are so many things wrong with that commit that I can't even begin to count them" email should show up any time now...

Plus, I must admit, the email address is pretty cool ;-)

Wednesday, December 1, 2004

Dependencies Suck

You know, I think it's a bit absurd that I had to install 15 separate perl modules in order to convince the Perl Email Project to successfully send an email with both html and text versions of a message.

I mean I'm all for modularity, but 15 fucking modules! And this is for a project that has as a stated goal to be "minimal in their external dependencies".

I'd hate to see what they would do if they totally pulled out all the stops and depended on whatever the fuck they wanted...