March 16th, 2004

status

This week's LJ performance has been pretty good, with the exception of yesterday evening, when our load balancing went all to hell. We found the bug, though, so we're happy/relieved. (it'd been randomly breaking/slowing the site for weeks....)

In general the site sucks for 1 of 3 reasons:

1) disks aren't fast enough / don't have enough disks (DB servers)
2) not enough CPU (web servers)
3) load balancing (the bug we fixed)

Having just fixed #3 last night, and #2 by buying a bunch more web nodes the other week, things are looking good. We also just got a new pair of DB servers, so while #1 isn't a problem yet, we're ready with new hardware.

Of course, Murphy will come and kick my ass now that I've posted that things are good, but whaddya gonna do?

santa - read only

We just lost a drive in Santa's disk array. It's rebuilding from a hot spare, but that database is now a little slow during the rebuild. (this is one of the DB clusters that's not yet in master-master, so we can't fail over to the other pair.... we're still working on upgrading all the old ones)

Anyway, during the slow period until it rebuilds (30 min - 1 hour or so), most users on the Santa cluster are in read-only mode. [Where am I?]

BTW, losing disks like this isn't a big deal. The best disks are rated (warrantied) for 3-5 years. Let's assume 4 years. If we have 100 disks (I'm just guessing), that's one failing every 2 weeks if it were spread out. But a lot of these disks we've had for years now. Fun fun. That's why you should never trust a single disk.... always use RAID, boys and girls! :-)