What happened? A power failure, apparently.
This surprised us: Internap, our host, has redundancy everywhere: multiple network connections, power grids, backup generators... even the building has some certifiable safety against earthquakes. Our servers were connected to both power grids. We didn't anticipate a power loss. Internap is really a great company, and we've had no problems with them for the past year.
There's one weakness in Internap's system, though, and that was discovered today. In the case of a fire, they need to be able to turn off all of the power running into the building completely so there's no possibility of firefighters being electrocuted.
There's a big red button in a glass box in some room somewhere that does this. Somebody took a visitor near that box. The visitor mistook the button for the button to unlock the door release...
(The button is now labelled.)
The site took a long to get back because bradfitz, our dedicated leader, ran a full integrity check on the database. A database is split into two separate files: the actual data, and the indexes into the data.
The indexes were corrupted, but the data is fine. Do not worry. :)
In the worst possible case, we can recover lost data from a backup, but any weirdness you may encounter is more likely related to the broken indexes.
How can we handle this in the future?
- We're buying an Uninterruptable Power Supply, so our machines can handle power failures gracefully. We thought one wouldn't be necessary, but it seems it would be best to stay on the safe side.
- (Technical digression:) A couple file systems needed fsck'ing. We may move to ext3 in the future.
- A few of the machines weren't configuring themselves properly when they booted. dormando has been fixing this.
- Once this is in place, we can test by unplugging our machines from the power.
Thanks for your patience. For once, it's not our fault! :)