Brad Fitzpatrick (bradfitz) wrote in lj_maintenance,
Brad Fitzpatrick

Problem summary

Two of the database servers weren't accepting new connections last night. Oddly, the machines weren't near swapping, had only half their normal connections, and had no load. So something's weird.

We're working with the database vendor to figure out why it happened. We're probably hitting some artificial limit we never raised. Assuming that, I've restarted the two offending database servers with lower resource usage parameters until we can figure out why it happened in the first place.

This wasn't a problem with bml as so many people decided. The only reason it was more obvious there was because we run less of those processes and they recycle themselves more often, which means they try to reconnect to the database more often. Normally if the database is gone, they complain immediately. But the database was accepting connections and then hanging, so the bml processes hung too. All the web processes hung, actually, but some lived through the night and those were the ones you were hitting... the ones that were responsive still. In another few hours, everything would've been unresponsive.

In any case, we're going to figure out why this happened in the first place, then document it so when we build future database servers it won't happen again.
Comments for this post were disabled by the author