July 10th, 2004

chef-8 status update

Lisa and I are beyond tired, so:

Brief update:
We're still working on repairing lost data from the cluster 49 to 28 move.

It's not going to be finished tonight, though. It'd be dangerous for us to continue this delirious anyway.

Details, for those inclined:
We have the pristine 49, the corrupt 28, and all the pristine changes made to 28 since 49 became 28. (28 became corrupt due to unflushed page writeout that occured after the 49 files went onto 28)

We could've rolled back to the old 49 at any point, but then people would've lost hours/days of entries/comments they made after the move.

Instead, we've cloned pristine 49 and are applying all the pristine changes from 28 against the new modified 49, and that'll become the new 28 when we're done.

Note: we've done these sorts of moves so many times. We weren't doing anything fancy. The reason for the corruption is because a step was missed that isn't generally necessary, but was this time. We've updated our sanity-checklist to include that step. We're also working on automating it all, so there's never the human factor messing things up.

Anyway, once this is fixed, we'll then have the tools and knowledge to fix similar issues in the future, if need be. (Hopefully not)
  • lisa

Chef

Time of post: 12:35 PST (GMT -07)

We are preparing to apply the final fix to Chef cluster 8 (where am I?). For the final step, we will need to take Chef completely down for about 20 minutes. We will be doing that in 30 minutes to 1 hour from now. To Chef users it may appear journals are missing or read-only.

After it is back up, there may be some bad data in memcaches so some lingering problems may exist while we clear them, but they will go away.

No data will be lost.

Thank you!
sock monkey
  • lisa

Chef update

Ok we have finally completed the work to restore the data for users on what is now Chef cluster 8 (Where am I?). At this time if any of you on this cluster experience any additional issues with incorrect or missing data please open a support request with CKQHFPRH in the body. We will continue to monitor this and update as necessary.

UPDATE 15:59 PST: Your entries and comments are not deleted. There is a bug we are working to correct now that is preventing you to view them. Thank you.
potato

Restoring Entries

For those of you who are on Chef, subcluster 8 (Where am I?) and have noticed missing entries (not comments), we've developed a tool to let you restore those:

http://www.livejournal.com/misc/entry-restore.bml

Note that the date will be wrong and the security will be set to private. You'll have to fix those by hand.

If this tool works for you, let us know in the comments. If it doesn't work for you, let us know, too, as well as filing a support request in our support center.

Our restore this afternoon fixed most everything, but a few things went into the database corrupted because of the previous corruption, and replaying them onto the clean database just brought along the corruption.

Also, off-topic comments to this post will be deleted, since we really want to hear from affected people, and not "first post!" or "great userpic!"

We really appreciate everybody's patience and support through this. We're very sorry this has turned out to be such a pain. If it's any consolation, we're all the wiser now.
computer crap

Chef subcluster 8 paid time reimbursement

We'd like to apologize to all users of Chef subcluster 8 (Where am I?) who were impacted by the last few days of problems. We consider your data and privacy the most important things to maintain. It's very rare for us to experience data loss and the last few days have been particularly depressing considering how much time we've spent working on trying to restore it all.

As we explained before, there was some data corruption that occurred during a seemingly-routine user move. The most obvious impact to users was the loss of personalized style information and some user preferences. Despite our best efforts we were only able to restore most of the lost data. While there's a possibility we may be able to recover even more, we can't promise anything at this point.

Once again, we'd like to apologize to users on Chef subcluster 8. For those users on this cluster with paid accounts, we would also like to offer you a one month extension of paid account time. If you have been affected by these problems, you can claim a free month of paid time here:

http://www.livejournal.com/misc/claim-cluster28.bml

The claim tool will work for the next week. If you have friends who were affected and don't read this journal, let them know. If your account has expired in the last 2 days, you can buy a new one, then claim your extension when your paid account is active.

We're working hard to prevent these issues in the future and we've definitely learned a lot from this incident. Thank you for your understanding.