Evan Martin (evan) wrote in lj_maintenance,
Evan Martin
evan
lj_maintenance

As some of you may have noticed, the site was down for much of today.

What happened? A power failure, apparently.

This surprised us: Internap, our host, has redundancy everywhere: multiple network connections, power grids, backup generators... even the building has some certifiable safety against earthquakes. Our servers were connected to both power grids. We didn't anticipate a power loss. Internap is really a great company, and we've had no problems with them for the past year.

There's one weakness in Internap's system, though, and that was discovered today. In the case of a fire, they need to be able to turn off all of the power running into the building completely so there's no possibility of firefighters being electrocuted.

There's a big red button in a glass box in some room somewhere that does this. Somebody took a visitor near that box. The visitor mistook the button for the button to unlock the door release...

...pow.

(The button is now labelled.)


The site took a long to get back because bradfitz, our dedicated leader, ran a full integrity check on the database. A database is split into two separate files: the actual data, and the indexes into the data.
The indexes were corrupted, but the data is fine. Do not worry. :)
In the worst possible case, we can recover lost data from a backup, but any weirdness you may encounter is more likely related to the broken indexes.


How can we handle this in the future?
- We're buying an Uninterruptable Power Supply, so our machines can handle power failures gracefully. We thought one wouldn't be necessary, but it seems it would be best to stay on the safe side.
- (Technical digression:) A couple file systems needed fsck'ing. We may move to ext3 in the future.
- A few of the machines weren't configuring themselves properly when they booted. dormando has been fixing this.
- Once this is in place, we can test by unplugging our machines from the power.

Thanks for your patience. For once, it's not our fault! :)
Subscribe

  • AOL Notification problems

    **Jan 15, 19:18 UTC/GMT UPDATE** As of about 2 hours ago we received confirmation from AOL that we've been placed on their whitelist and that LJ is…

  • Notification System

    **FINAL EDIT Thu Dec 10 02:15:47 UTC 2009** So there is the final update... Over the past day we have processed around 11 million jobs out of the…

  • Maintenance today, some expected downtime

    We are planning to do some database maintenance today, and LiveJournal could be down for a half of an hour during this period. The maintenance is…

  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

  • 128 comments
Previous
← Ctrl ← Alt
Next
Ctrl → Alt →
Previous
← Ctrl ← Alt
Next
Ctrl → Alt →

  • AOL Notification problems

    **Jan 15, 19:18 UTC/GMT UPDATE** As of about 2 hours ago we received confirmation from AOL that we've been placed on their whitelist and that LJ is…

  • Notification System

    **FINAL EDIT Thu Dec 10 02:15:47 UTC 2009** So there is the final update... Over the past day we have processed around 11 million jobs out of the…

  • Maintenance today, some expected downtime

    We are planning to do some database maintenance today, and LiveJournal could be down for a half of an hour during this period. The maintenance is…