Some unexpected problems forced us to take the Cartman cluster (where are you?) down. We had the choice between:
- repairing it relatively quickly, but making it look like the journals were temporarily missing during the repair, or
- repairing more slowly while the site continued to run in read-only mode.
We’re really sorry, and rest assured this frustrates us as much as you. (I’ve been waiting to post an entry all day...) As all failures are different, we’ve learned a lot from this experience, and LiveJournal will get stronger as a result.
The Cartman cluster master was running an older Linux kernel that had a bug in ext3, so the file system ate itself after an uptime of almost a year. Last night, before the problem was discovered, we rebuilt the indexes, but that wasn’t enough to save things this morning. Rather than take the whole cluster down, we let the slave run the cluster in read-only mode while the master (slowly) rsynced the data off of it.
Bad experiences with upgrades in the past (if it’s not broke...) meant that we hadn’t upgraded, but newer kernels have been working for us elsewhere so we’ll be sure to upgrade the rest.
The Bright Side
We found three places in the code that weren’t tolerant to cluster masters being down and fixed those. Our inability to post to lj_maintenance during the outage led us to start thinking about a scheme where people’s journals are actually stored on multiple clusters, so a single critical machine failure wouldn’t cause a large fraction of our userbase to go read-only in a future failure.