Brad Fitzpatrick (bradfitz) wrote in lj_maintenance,
Brad Fitzpatrick
bradfitz
lj_maintenance

Ribeye Troubles

If you're missing any posts/comments since after the ribeye move, read on....

I'm still working on a few cases where people lost posts/comments during the ribeye switch.

The root of the problem is still that config file typo and the subsequent "fix" without restarting the webservers at the right time. There was a small window where posts/comments/etc were going to one of two master databases, randomly. (there should only be one master database at a time)

In any case, not everything will be fixed. Recent comments/posts probably won't all be restored. But in cases where a new post getting postid #1 overwrote the real, many month/year old postid #1, then I'll definitely fix it, restoring from the pristine copy on the old cluster.

It's really unfortunate I made that typo in the first place because it makes it seem like the cluster moving system isn't robust. On the contrary, we've moved hundreds of thousands of people without problems, both in this last round and in months past.

Restoring the data from this sort of screw-up is really hard: things are spewed onto both databases, often with overlapping ID numbers. And every case is hard enough to require a tool to be written to fix it, but also too unique to let that tool be used for anybody else's problem. It's been eating all my time lately. :-/

The real fix is to make sure this never happens again, and I will. Before we start the next move, I'll formalize our cluster vs. subcluster system, so tons of disgusting configuration isn't required in the first place, which is where the error occurred. I'll also make a system to check our configuration for problems like this.

But before I move on to that, I'll be working on restoring people's data who lost really old posts/comments.

I think I'll tell the support crew to refer people to this post. If you lost something important, please file a new support request including in the subject "[dataloss]" with both details and a note saying you've read this post. I just don't have time to restore unimportant posts/comments that could just as easily be posted again. But if it's something old/important, I do need to fix it.

Thanks for reading, and sorry for the troubles.
Subscribe
Comments for this post were disabled by the author