so i spent much of this week tracking down a really irritating bug, and since it taught me an important lesson, i figured i'd share.
the bug manifested itself as one my my servers eventually, under very heavy load, ceasing to reply to requests. when we broke into the process in the debugger, everything seemed normal, and the server's internal count of the number of worker threads it had running said that it was totally maxed out, with like 300 worker threads working away, but actually listing the threads in dbx showed no worker threads at all!
so i started out by sticking debugging output all over the life cycle of the worker threads, anywhere that it could possibly be exiting where it really shouldn't, but for days, it would run with none of those outputs showing up, and yet the problem would still occur eventually. my worker threads were dying, and i couldn't see why!
finally, i started reading some more docs about the debugger, and i discovered that there was an extra argument you could pass to the debugger that would cause it to list all the threads, not just the active ones. when i got a chance to see the problem in question occuring, i tried this, and it showed tons of zombie threads sitting around. so i start reading up on what the hell a zombie thread is...
around this time, one of my coworkers was hanging around my cube, talking about this problem and others that we are trying to track down, and he said something that pointed me right at what the problem was.
it turns out that zombie threads are the left over parts of the thread that stick around after they exit, so that you can later do a pthread_join on the thread to figure out it's exit status. we were not actually doing this when our workers eventually exited, as the exit value of our threads were inconsequential to the rest of the application, so having the zombies wait around was kind of silly. worse yet, having the wait around was causing pthread_create to actually fail later on because it was out of resources, and i wasn't picking up on this because i was a moron and didn't check the return value from that function. so i'd go along blissfully unaware that i had failed to create a thread, increment our internal counter saying 'yes, we have one more worker now', when in truth we did not. thus the weirdness where we end up with no threads but think we are maxed out.
to solve the problem, you either need to join with the threads after they die, so that the thread library can stop keeping that data around and reuse it, or you can just detach the threads, by calling pthread_detach or by setting an attribute in the thread before calling pthread_create.
once again, let's be clear that if i wasn't less of a moron and had remembered to check the return value from pthread_create, this week long debugging saga would have been considerably shorter.
there's one more mistake i won't make again...