dormando (dormando) wrote in lj_maintenance,

Paid, and then whole site issues.


The last few days have been rather annoying. I'll try to detail it to the best of my memory.

In English: Some small maintenance didn't go as planned. The paid servers were slow for a while. I took one of the free servers and added it to the paid servers. The paid service should be fast now.

In Spec: All paid DB traffic used to go to jesus instead of its own set of slaves, in order to make sure there were no paid problems with replication. However, recently, jesus seems to be slowing down from something. I am having a hard time locating the problem, and will be researching it for a while. It isn't a common problem, for sure, since all of the numbers seem low. I won't get into much detail on this, since it would take me too long to type out all of the details/symptoms and the things I've tried already.

Last night we decided to try shutting swap off on jesus to speed it up some. I had previously done this with every other DB slave while it was running, and it did not incur any downtime on the slave or noticable slowdowns.

When I did it on jesus, however, it slowed the machine down so much it stopped responding for several minutes at a time. It crashed MySQL, took ten minutes to actually clear the swap, and once swap was clear, kswapd started soaking CPU. MySQL was only taking 1.2G of memory on the machine, and it has 12G of RAM total. There was really no reason for it to want to swap so badly.

Finally, I turned swap back on, and it swapped a grand total of 1.5 megs of data, then instantly became fast again. More time was lost while repairing the indexes on two tables. Once that was done, the site came back ok.

During this time I had taken one of the DB servers out of the free pool, and then gave it a majority of paid traffic, but still leaving some to jesus. Everything has been fine since.

Also during this time, brad and I were trying to bring back the maintenance site (displays a small error message), which we would re-route people to during times like that. It was lost during a harddrive failure (missed backing up a few config files. Got it covered now.) We were too busy bouncing between the problems to figure out why it wasn't working. But for a few minutes it displayed a nice error message with my e-mail address on it :) Thanks for all of the hate mail. It was touching.

On a minor note, one of the free webslaves wasn't sending e-mail notifications the other day (so one out of every ten e-mails wouldn't get sent.) I fixed it and flushed all of the waiting e-mail. I'm pretty sure not a single one was actually lost, since it had about 450 megs of mail in the spool directories.

Comments for this post were disabled by the author