November 1st, 2001

Problem summary

Two of the database servers weren't accepting new connections last night. Oddly, the machines weren't near swapping, had only half their normal connections, and had no load. So something's weird.

We're working with the database vendor to figure out why it happened. We're probably hitting some artificial limit we never raised. Assuming that, I've restarted the two offending database servers with lower resource usage parameters until we can figure out why it happened in the first place.

This wasn't a problem with bml as so many people decided. The only reason it was more obvious there was because we run less of those processes and they recycle themselves more often, which means they try to reconnect to the database more often. Normally if the database is gone, they complain immediately. But the database was accepting connections and then hanging, so the bml processes hung too. All the web processes hung, actually, but some lived through the night and those were the ones you were hitting... the ones that were responsive still. In another few hours, everything would've been unresponsive.

In any case, we're going to figure out why this happened in the first place, then document it so when we build future database servers it won't happen again.

Actual maintenance!

We've been converting machine after machine to ext3[1] so if another power failure hits, it doesn't cause so much chaos.

We'd been talking about doing this "sometime soon" for weeks before the power outage. Guess we learned our lesson.

Most machines we can just reboot and the load balancer won't give them traffic while they're dead. However, now it's time to do the master database server. The site will be down for 5 minutes at most in the next hour here sometime. (Waiting for sherm to call me from the NOC ... he's going to be there just in case there's a problem with the machine not coming back up.)

[1] I like ReiserFS also, but converting an existing machine from ext2 to ext3 is just so painless: only one command.