February 10th, 2002

Status

Summary:
Moving users to clusters.
Things getting good.
Misc hardware failures annoying, but non-issue.

Long Version:
Clustering move is going well. Slow, but well. The performance gain is already visible in many areas. The active users should be fully moved in under 3 weeks.

Also good --- we have a new internal database management tool that makes tasks that were once incredibly tedious & painful a breeze. This lets us very quickly react to problems and do maintenance. It also leads the way to a ton of smart automation ("This database slave is over 1MB behind, let's take it out of rotation until it catches up, and when it does, lower its load by 5%.")

Friday was a perfect day -- no outages or slowness. Haven't had one of those since the Friday prior to that. Saturday was pretty good, but had some slowage mid-day and a few db issues.

The future looks good, though ... as more users gets clustered it'll just keep getting better.

Most annoying right now is we're dealing with faulty hardware left and right ... had one ethernet port on a web server go half bad, (just switched it to the other port), had a processor die on another web server (but the other kept working), had a whole slave db server die (it always sucked... it's been nothing but trouble from day 1), and now we had a disk die on another slave db server (but the hot spare kicked in and the array rebuilt within a few minutes).

Still ... all these issues mean we get stressed out and have to call vendors and get replacements. Most troubling is the dead slave db server .... that's some extra CPU that'd be nice to have considering we pulled two slaves out to be cluster slaves. But never fear --- we're just throwing in our backup machine into rotation. We normally don't use it so we don't get used to having it, but only throw it in when needed. We're also putting Jesus into slave rotation, since it's doing so much less lately, it can do more reads.

Request: Try to stay on topic. I get emailed all your replies, and I try to reply to people that have good questions, but I can't find the good questions if there are hundreds of jokes about naming a server Jesus. :-)

Clustering Q&A

Some good questions have come up from my last post, so I'll answer them here so everybody sees them...

roshi asks: "Is moving journals to the new cluster/the repairs etc the reason why I am getting "This journal is now in read-only mode" messages in my client? Or have I done something stoopid?"

Answer: Yes. While your journal is moved to a cluster, you can't post anything new and people can't leave comments in your journal. However, you can post in other people's journals, and people can still read your journal & comments. A conversion takes between 3 minutes and 20 minutes, though we've even seen some take up to 45 minutes... it depends on how many posts and comments there are in your journal. We thought this was a better trade-off than taking the whole site down for a month while we converted everything at once. :-)

darkmoon asks: "Just out of curiosity, how is the order of moving the users being determined? I appear to be moved already while most of the people on my friends page haven't yet. If this has already been discussed, a link will do. :)"

Answer: We're converting random users that have posted in the last 2 days. At night we convert around 14 people at once. During the day, around 5. Currently there are 68,386 people that have posted in the last 2 days still unconverted. 3,580 are on cluster1, and 2,135 are on cluster2. All new users signing up are going to cluster2. Cluster3's hardware we have, but won't make it into a cluster yet. After we convert the 68,000 recent posters, there are about 400,000 less active accounts to convert... those should go a lot quicker, since 1) they'll have fewer posts, and 2) we can run the converter a ton faster, since the master database will already be idle after having moved all the active users away.

belgiandom asks: "With this new cluster setup, is there still a speed advantage for paid accounts?

And, if I turn my account into a paid one at this point, would it actually get moved to the right server, or might whatever script is doing the moving screw things up?


Short Answer: Yes, there will still be a speed advantage.

Answer: There are no "paid clusters" or "free clusters". The philosophy is that all database servers are going to be insanely fast. So then the "problem" is that the site gets a lot faster for free users. Well, is this a problem? Kinda. The big issue is that we pay by bandwidth ... if the site is faster, people use the site faster, and as a result, it costs us more to support free users. So what we'll probably end up doing is throttling the bandwidth for free users, or adding an artificial delay to each free user request. The delay won't make the site feel slower than what it was historicaly ... it'll still seem faster. But it won't blink onto the screen. Paid users will still have their own set of web servers that are a lot less loaded, and not artifically delayed at all. I hope people don't complain about this delay ... we're not capitalistic bastards. We're trying to keep everybody happy. The goal is to make the free site feel fast, and the paid site to feel very fast.

brant says: "You guys rock. Can I send more money? =D"

Answer: But of course! :-) Buy a friend a paid account! Extend the expiration date of your own paid account! We promise to do interesting & useful things with all money we get. So far none of us own mansions, private jets, or Porsches. :P