The problem was one database cluster which has two times more active users than any other. Even though we've done so much work putting everything into memory lately, that puts more stress on the master database in the cluster. I'll explain:
In the past, when code wanted information about a user, it'd pick a [weighted] random database in that user's db cluster: slave or master. Now, using memcached, the code first tries memory, and if it's not there, gets the definitive copy from the master (so there's no replication delay) and puts that in memcache. So now, slave dbs aren't so useful. They're just there for snapshot backup points and such.
The ribeye cluster has 60,000 active users. The next nearest one has 31,000. Why did we let it get this bad? Well, we were testing a new database config (10 independent database on one machine) which allows more concurrency and thus more active users per cluster. Our old limit was lower than 60k. Now we've [painfully] learned how far this new system scales (on this type of I/O subsystem at least). We can move users around, but that's really slow to do them one at a time, so fortunately there's a better way. Because the databases are split into 10 in this new config, now we can easily move 1/10th of the traffic around to different db clusters quite easily.
We're in the process of freeing up one legacy database cluster, moving all users to either 1) another active cluster (in the new config), 2) the idle cluster, or 3) expunging their data, if they're deleted and it's been over 30 days. When that finishes, we'll have a fresh database cluster to use, and we'll move 40% of the loaded database's users over to it.
However, as that won't happen for the next couple days, we have to do something in the meantime to make the loaded database cope. Different plans include:
1) using the slave dbs in the cluster more wisely. There are ways to ensure the data on the slaves is current. One simple example is comments: if they're on the slave, they're current, since they can never be updated. There are ways for the rest of the data too, but they're a bit more involved.
2) do some database table optimizations, but that'd involve turning the loaded cluster's users read-only for maybe an hour each. luckily the database is split into 10 manageable chunks, though, so we can optimize the data in each database in 10 stages, so any given user isn't stuck in read-only mode too long. this may happen tonight, but more than likely it'll happen in a couple days, or tomorrow night during the maintenance window.
In any case, we're aware of the problem and know how to fix it... we're just waiting on things to move around. Sorry. :-/
Update: It looks like the problem might be Semagic's new (and broken) syncitems support, pounding the site wrong. The syncitems mode bypasses memcache. That client was released right at the same time (Sunday night) that the site started to suck again. That's too much of a coincidence. We're either going to ban syncitems for broken clients, or direct all that traffic to the idle slaves. This might be easier to fix that I thought.... I'm glad there's finally a possible reason. The site and dbs were totally idle for a couple days there... I didn't think Sunday night's normal load would've done what it did. Anyway, back to investigating...