Brad Fitzpatrick (bradfitz) wrote in lj_maintenance,
Brad Fitzpatrick
bradfitz
lj_maintenance

Suck

Hey all,

Sorry about the crap performance the past couple hours. In a nutshell, I started doing something which spiralled out of control. I should've done it late tonight, but I mistakenly thought doing it now wouldn't affect anything.

It all went downhill when I started a table conversion (which I couldn't stop) on one machine that instead of taking 8 or 9 minutes like other tables has been running now for 3.73 hours and isn't anywhere close to done. That wasn't supposed to interfere with that machine's normal task, but it did because I overlooked that it'd block the replication thread, and our code stops using a machine once replication behind. (I thought it was safe because the table I was converting wasn't being served to clients, but I forgot about the replication thread blocking....)

So then that machine, an S2-hot slave, was out of operation, so its load spread to the other slaves and killed their caches.

So then I had to scramble to setup another machine to fill busy machine's role so the other machines could get their caches hot again.

That took a while because the spare machine didn't have a gigabit network card to sync the global database. It had one, and it was on a gigabit switch, but they wouldn't negotiate, even after trying a dozen things on both the switch and host. So finally we waited for the database to sync over a slow 100Mbps connection.

Finally it finished and we gave it traffic, and things have been getting better since, as everybody's caches got happier.

But, ...... the bright sides:

1) one of the major things that sucked (the particular query, finding friend-ofs) during the suckage, while the caches were cold, has been changed to use memcache. That code isn't live yet, but will be soon enough, so it won't hurt so much in the future should this happen, and it'll be faster in general too.

2) we found a couple other things we could easily change, and some harder things that aren't incredibly hard, to speed things up.

3) the experiment that's running on the machine I accidentally killed should still prove to be interesting, once it finishes. (and it lets us know that we shouldn't do this on the global master later, since it won't actually take 7-8 minutes like the other tables did....)

Totally unrelated: in the future, we'll be moving towards using MySQL Cluster for all the performance-sensitive tables, so we're not at the mercy of terribly slow disks which are the constant bane of our performance. That experimentation is still ongoing, though, as MySQL Cluster isn't fully mature yet.

What doesn't kill you, only makes your stronger, etc, etc....

Sorry, this wasn't fun for me either. :-/
Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

  • 209 comments
Previous
← Ctrl ← Alt
Next
Ctrl → Alt →
Previous
← Ctrl ← Alt
Next
Ctrl → Alt →