Brad Fitzpatrick (bradfitz) wrote in lj_maintenance,
Brad Fitzpatrick
bradfitz
lj_maintenance

What's up lately...

I just got back from a late-night run in the cold and it put a lot of things in perspective. As I was just telling Dormando, I have a few minutes to kill before I jump in the shower so I'll use them trying to explain to everybody what's been up this past week.

The previous lj_maintenance post had comments disabled. First of all, an explanation of why we do that sometimes: when we're in the middle of fixing problems, incoming emails are a great distraction, especially from people complaining about things which we well know about. Or worse, asking questions, which we try to hard to put off answering until later when the problem's fixed, but it's hard sometimes and we waste time replying when we could be fixing things.

Perhaps in an ideal environment we'd all be close together and Dormando and I could just yell at some PR dude what the problem was and he could type it up on lj_maintenance all business-like and friendly. Instead, Dormando and I are in the middle of discussing things and trying to fix the site and one of us mentions to the other, "Oh, we should probably go post in lj_maintenance, go do that quick." The result is a quick, unhappy, undescriptive post about the poblem. We know it's undescriptive and sloppy, begging for more info, which is another reason to turn comments off.

We always intend to follow problems up with a further explanation in lj_maintenance, but usually by the time things are fixed we're so relaxed or tired (or both) that we just forget or fall asleep.

So, on to real issues...

Why has the site sucked on and off the past week? The good news is that it all stems from the same problem: proper tuning of our new huge expensive beast of a database server (jesus). Over the past week jesus has stopped accepting new connections at random times. With hundreds of parameters to tune at 3 or 4 different levels it's quite a chore, even with good documentation.

The noticable problem lately has been "BML pages not working" as people keep reporting. There is no problem with BML. All processes on the site retire and die after a few hundred requests to prevent any memory "leaks". (When perl allocates memory from the OS, it keeps it forever in its own buffer pool for future allocations... it never gives it back to the OS with free(). So even if perl frees the memory, if the process grew huge at some point, it stays huge forever.). The reason BML pages were stopping was because they were recycling faster than the journal viewing processes, since people hit BML pages more often. And every time a process starts up, it allocates one persistent connection to both the master (update) database and a slave (read only) database. But the master database wasn't accepting new connections, since we'd hit some soft limit for an unknown parameter we hadn't set correctly. So BML pages would hang. We'd go restart everything when we saw that, and we'd be good again for anywhere from 30 seconds to another day. The randomness pissed us off. The error logs and other debugging facilities indicated nothing. This problem lasted many days.

We finally took a web server out of the free user web pool to limit the connections while we investigated the real problem. We found the problem and fixed it (we think) just the other day.

But then today's replication lag problem! That was actually something a bit different ... when InterNAP had the power outage last week or so, a bunch of the slave db files were corrupted, since they're written asynchronously for speed at three different levels: the RAID card's write-back cache, the OS filesystem (async flag), and MySQL (delayed-key-writes-for-all-tables). The last MySQL flag was the source of today's problem. We hadn't turned it back on since the power outage since we'd been so busy debugging the Jesus issues (which were also affecting Mackey, the paid user slave db server). The site was holding up and the less things we changed at once, the easier it was to isolate the real problem. (one control variable at a time, or something... think I heard that in school or something. :P)

Anyway, we knew that option was off, and Dormando had been bugging me for days to turn it back on so the slaves would run a ton quicker, but I kept telling him, "no, no, we're fine for now." Yeah, we were fine, until load hit the peak and they couldn't keep up anymore. The other reason I didn't want to change them was because that'd involve taking them out of our db pool, which is a tedious and involved process at this point. Today I wasn't around for a few hours at the peak of the problem and he just went ahead and did it. We also set the noatime mount flag on a few freshly rebuilt servers that we'd forgot it on.

So here are the things we're doing to address the situation:

* Jesus fixage -- well, it's already fixed we think. But this is good, since this was the real problem.

* Server build documentation -- we're writing step-by-step build procedures for each type of server. The idea here is that in the future when we get new hardware, we won't forget important steps like optimal RAID block sizes or filesystem mount flags. You can see the start of this work in CVS already. Expect more to go in the next few days. The side benefit of this is that I don't have to build the servers anymore... we can find a friendly monkey to do it for us.

* InnoDB -- we've been avoiding this for too long. InnoDB rocks, we know. The default table handler for MySQL (which we're using) offers no concurrency whatsoever. That's why we spread the reads out to tons of slave DBs. But the throughput on our replication setup drops over time as writes increase. We're switching everything over to InnoDB soon-ish, but first we're rereading all the documentation for the 7th time and testing.

* DB selector daemon -- our load balancing for db slave connections consists of a list of servers, weight values, and our friend rand(). But randomness sucks, especially when a random server is dead. I wrote a daemon to monitor all the slave databases' performance, load, and eventually replication position. I wrote that a long time ago. I need to get back to it. Then swapping in and out dbs for maintenance is easy. We modify the weight file, dbselector daemon re-stat()s it every 10 seconds, reloads it. Then we wait 20 seconds. Every 20 seconds all web slaves must re-check out their lease on their db handle. If the database is still around, the dbselector daemon says the web client can keep using it, else it's assigned a new one. This has the added benefit that if a db server randomonly dies in the middle of the night, the dbselector realized it and the site fixes itself immediately. (db handles are also checked for validity at the beginning of each web request... so if a db dies, the next request tries to use it, it fails, and asks the db selector for a new one). Also, the dbselector daemon is load balanced internally on the BIG-IP for high availability. The web clients don't connect to one machine for the dbselector... they connect to a virtual IP on the load balancer which finds one of 2 dbselectors. (we do the same for our internal DNS) But the redudancy doesn't stop there! Our second BIG-IP arrived in Seattle via FedEx according to the package tracker. It should be delivered tomorrow.

Once we have the new BIG-IP installed and the dbselector daemon in use, nearly any machine can die and things will keep running automatically without our intervention. It's already like that for all our web servers. With the dbselector daemon, any slave db can fail. The only thing that could take down the site then would be the master database and drive array, which we're planning on a buying a copy of as a hot spare. If that goes down, then we'd have to manually switch some stuff over, but it'd take an hour, not 2-3 weeks for a new machine to be ordered and delivered.

In summary, we know what we're doing (at least I think) but we're just not good lately with keeping people informed.

When I'm stressed and things are breaking, the last thing I want to do is type novels like this one about why things are breaking (especially when we haven't even figured out the root cause). Just trust that we'll clue you in as soon as we have time and have something interesting to tell you. But we'll try to be better at giving up-to-the-minute news.... I'll opiummmm or sherm and have them post or something.

Anyway, that's that. I'm sure I forgot something. Discussion here is okay, but again, a reminder not to post important questions here as they might be overlooked. If it's important, file a support request. Thanks.
Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

  • 53 comments
Previous
← Ctrl ← Alt
Next
Ctrl → Alt →
Previous
← Ctrl ← Alt
Next
Ctrl → Alt →