Evan Martin (evan) wrote in lj_maintenance,
Evan Martin

cluster 1 outage

Some unexpected problems forced us to take the Cartman cluster (where are you?) down. We had the choice between:
  1. repairing it relatively quickly, but making it look like the journals were temporarily missing during the repair, or
  2. repairing more slowly while the site continued to run in read-only mode.
We chose the latter option. As usual, no data has been lost, and things appear to be working smoothly now.

We’re really sorry, and rest assured this frustrates us as much as you. (I’ve been waiting to post an entry all day...) As all failures are different, we’ve learned a lot from this experience, and LiveJournal will get stronger as a result.

Geek Version
The Cartman cluster master was running an older Linux kernel that had a bug in ext3, so the file system ate itself after an uptime of almost a year. Last night, before the problem was discovered, we rebuilt the indexes, but that wasn’t enough to save things this morning. Rather than take the whole cluster down, we let the slave run the cluster in read-only mode while the master (slowly) rsynced the data off of it.

Bad experiences with upgrades in the past (if it’s not broke...) meant that we hadn’t upgraded, but newer kernels have been working for us elsewhere so we’ll be sure to upgrade the rest.

The Bright Side
We found three places in the code that weren’t tolerant to cluster masters being down and fixed those. Our inability to post to lj_maintenance during the outage led us to start thinking about a scheme where people’s journals are actually stored on multiple clusters, so a single critical machine failure wouldn’t cause a large fraction of our userbase to go read-only in a future failure.

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

← Ctrl ← Alt
Ctrl → Alt →
← Ctrl ← Alt
Ctrl → Alt →