Brad Fitzpatrick (bradfitz) wrote in lj_maintenance,
Brad Fitzpatrick

chef-8 status update

Lisa and I are beyond tired, so:

Brief update:
We're still working on repairing lost data from the cluster 49 to 28 move.

It's not going to be finished tonight, though. It'd be dangerous for us to continue this delirious anyway.

Details, for those inclined:
We have the pristine 49, the corrupt 28, and all the pristine changes made to 28 since 49 became 28. (28 became corrupt due to unflushed page writeout that occured after the 49 files went onto 28)

We could've rolled back to the old 49 at any point, but then people would've lost hours/days of entries/comments they made after the move.

Instead, we've cloned pristine 49 and are applying all the pristine changes from 28 against the new modified 49, and that'll become the new 28 when we're done.

Note: we've done these sorts of moves so many times. We weren't doing anything fancy. The reason for the corruption is because a step was missed that isn't generally necessary, but was this time. We've updated our sanity-checklist to include that step. We're also working on automating it all, so there's never the human factor messing things up.

Anyway, once this is fixed, we'll then have the tools and knowledge to fix similar issues in the future, if need be. (Hopefully not)

