The cutover to the new load balancers went relatively well. We've definitely had worse nights... this was no sweat. Props to Matthew.
Junior and I are verifying the config is all correct, but from all we can tell things are flying and all the page types seem to load.
We'll continue to monitor things for the next hour and Matthew is still on-site testing as well, if we need him.
We're having packet loss on the LiveJournal internal network due to last night's load balancer upgrade. Because the new load balancers were plugged into a different/wrong spot from the old ones, they are now stressing a weak switch of ours and maxing out a link/backplane. So -- we're all waking up and going to move things around again, ... should just involve unplugging it and plugging it in somewhere else.
Will update again when things are happy.
So the way our new load balancers were installed, there was gigabit (1000) all over, except one path that was overlooked that was 100 Mbps... so effectively the entire site was capped at 100 Mbps outgoing traffic.
Matthew fixed the problem and we're now back to pushing 250-350 Mbps.
So I think we're good now.
We'll be around all day watching it.... the sleep gods have it in for us.
Posting by phone works again.
Our apologies for the downtime.... load balancing UDP is weird.
So apparently there's this rumor going around that's making people run for the hills and archive their accounts, effectively killing the one remaining database cluster (Chef) that's not InnoDB yet.
So there are two fixes that make the site work:
1) make free users on Chef read-only, to eliminate read/write lockiness in non-concurrent MyISAM tables.
2) block the backup clients for people on the Chef cluster, and resume read/write access to all users.
For now I've gone with #1, since I have to run to dinner because I'm starving and I need the site to work.
Perhaps later we'll do #2 for a couple hours while we rush a migration of all those users on Chef to InnoDB database clusters.
But this is just a warning that we'll only be shutting down the archiver access later for that one cluster (like 7% of users?) and just until we finish migrating people to better hardware.
But for now, sorry free users on Chef.... for the good of the many? :-/
It was the best I could do for now....