January 5th, 2005

And we're back....

The cutover to the new load balancers went relatively well. We've definitely had worse nights... this was no sweat. Props to Matthew.

Junior and I are verifying the config is all correct, but from all we can tell things are flying and all the page types seem to load.

We'll continue to monitor things for the next hour and Matthew is still on-site testing as well, if we need him.

load balancer and network

We're having packet loss on the LiveJournal internal network due to last night's load balancer upgrade. Because the new load balancers were plugged into a different/wrong spot from the old ones, they are now stressing a weak switch of ours and maxing out a link/backplane. So -- we're all waking up and going to move things around again, ... should just involve unplugging it and plugging it in somewhere else.

Will update again when things are happy.

And we're fast again.

So the way our new load balancers were installed, there was gigabit (1000) all over, except one path that was overlooked that was 100 Mbps... so effectively the entire site was capped at 100 Mbps outgoing traffic.

Matthew fixed the problem and we're now back to pushing 250-350 Mbps.

So I think we're good now.

We'll be around all day watching it.... the sleep gods have it in for us.

phone posts

Posting by phone works again.

Our apologies for the downtime.... load balancing UDP is weird.

more work

Paid user requests are at fast sub-second speeds, but free users requests just got behind at around 3-4 seconds because our one old database cluster that hasn't been upgraded over the past month is now croaking in load.

SO .....

We're gonna work on it now, instead of at night.

There will be two steps:

-- Step 1: the users on Tbone Cluster (Where am I?) will go read-only for like a minute while we convert it from master + 2 slaves to 2 masters. (so basically we're making the two current unused children be each other's masters... and the active parent now will be free....) The 2 new co-master machines are beefier than the old parent. All good.

-- Step 2: we'll take that old hardware that was running Tbone and make it help out the Chef cluster which is the last unconverted cluster. (everybody else is InnoDB now, except Chef that's still MyISAM, and straining). That step will be easier than step 1, but both are relatively easy.

Just letting you guys know why your accounts might be read-only for a bit... it'll get rid of the 3-4 second delay for free users.

Then over the next week we'll be slowly moving all Chef users to other clusters that are already fast and InnoDB, then Chef hardware will be upgraded and repurposed.

Got all that? :-)

Update

step 1 of this post is done. tbone users are done with their readonly window.

step 2 will happen in 30-40 minutes. we're doing a backup of the old tbone data pre-deparenting just to be paranoid. until this finishes, the site will remain slow for free users. once the backup's done we'll be able to reuse the old tbone master as another slave to help out chef users, and the whole site will get faster.

If you look at the bright side, only 9.2% of our active users are still on MyISAM (which can't really cope with concurrency at all, at least not without dirty hacks).... Once we're done converting chef hopefully we can ignore databases for a bit. So far the InnoDB machines we've already converted have been great.

Updates as they come...

archive paranoia

So apparently there's this rumor going around that's making people run for the hills and archive their accounts, effectively killing the one remaining database cluster (Chef) that's not InnoDB yet.

So there are two fixes that make the site work:

1) make free users on Chef read-only, to eliminate read/write lockiness in non-concurrent MyISAM tables.

2) block the backup clients for people on the Chef cluster, and resume read/write access to all users.

For now I've gone with #1, since I have to run to dinner because I'm starving and I need the site to work.

Perhaps later we'll do #2 for a couple hours while we rush a migration of all those users on Chef to InnoDB database clusters.

But this is just a warning that we'll only be shutting down the archiver access later for that one cluster (like 7% of users?) and just until we finish migrating people to better hardware.

But for now, sorry free users on Chef.... for the good of the many? :-/

It was the best I could do for now....