We've been getting a lot of email from people lately (especially paid users) complaining how poor the service has been. We're truly sorry. It is getting better, though, it's just difficult to explain to people without getting too technical. If you don't feel like paying anymore, don't. Keep using the site as a free user and when you feel like we're reliably fast enough, consider paying again. We don't want your money unless you're a happy customer. Make us earn your happiness and your money.
We're doing a lot of cool stuff to get the site consistently fast and reliable.
-- today was pretty fast, as enough Ribeye traffic is now moved over to Chef. (Ribeye was overloaded and we noticed way too late)
-- we're doing work on userpics late tonight, so they may be slow to load for a while. it may also impact general site speed, but not for long.
We have seven database clusters: five for active users, one for syndication accounts (RSS), and one for inactive users. The RSS one is a single machine (not a cluster) and is isolated just we were paranoid when we rolled that feature out. Isolation never hurts. The inactiver cluster has a big, IDE raid array. It's fast, but not as fast as SCSI. The other five clusters are pretty much identical: 6 really fast SCSI disks... 2 system/log disks in RAID 1, 3 data disks in RAID 5, and a hot spare.
Each user is on a cluster.
When disk space runs low on them or I/O gets too great, we move people around, move people to inactive, etc. We try and keep things balanced between all the clusters. Unfortuately, we weren't monitoring active users per cluster (we had manual tools to check it, but not automated), and we let Ribeye take all the new users for too long a period. New users are generally active users, so Ribeye suffered. When any cluster suffers, the whole site suffers, regardless of what cluster you're on. There can only be so many active/pending requests at any time. Normally it hovers around 50-200. When Ribeye was sucking, it was around 3500.
Today was good because enough users have finished moving back to Chef that Ribeye was able to take the load.
So there are basically two problems with clusters getting loaded: disk space and disk seeks (the drive head moving around making noise).
For the space problem (which is really the least of our worries), we're now independently compressing all posts/comments in the database. This makes things about 50% smaller. A side advantage is that the disk has to move around a bit less to get everything, and the computer's caches are better used.
For the seeking problem, we're looking at getting some solid-state disks and moving the most popular database index files there. Solid state disks are hundreds of times faster than disks, and there's no penalty for random reads. Because our index files are both slow and write/sync-heavy, SSDs definitely seem the way to go.
What we're doing to make sure this doesn't happen again:
-- more monitoring. we now have a much better means of quickly knowing active users/cluster, which means we can graph this over time.
-- better distribution of new users.
-- buying another cluster
-- index files on super fast devices
As for userpics, we're getting setup with Akamai late tonight. While Akamai's caches fill, they'll be hitting us pretty hard, possibly slowing down the rest of the site. Shouldn't last long, though.
I'm thinking we should have a community for discussing LJ's technical backend, so we don't bore the majority of people in lj_maintenance. From now on I'll try and post descriptive updates to lj_backend, and just link them from lj_maintenance.