Brad Fitzpatrick (bradfitz) wrote in lj_maintenance,
Brad Fitzpatrick
bradfitz
lj_maintenance

Status

Summary:
Moving users to clusters.
Things getting good.
Misc hardware failures annoying, but non-issue.

Long Version:
Clustering move is going well. Slow, but well. The performance gain is already visible in many areas. The active users should be fully moved in under 3 weeks.

Also good --- we have a new internal database management tool that makes tasks that were once incredibly tedious & painful a breeze. This lets us very quickly react to problems and do maintenance. It also leads the way to a ton of smart automation ("This database slave is over 1MB behind, let's take it out of rotation until it catches up, and when it does, lower its load by 5%.")

Friday was a perfect day -- no outages or slowness. Haven't had one of those since the Friday prior to that. Saturday was pretty good, but had some slowage mid-day and a few db issues.

The future looks good, though ... as more users gets clustered it'll just keep getting better.

Most annoying right now is we're dealing with faulty hardware left and right ... had one ethernet port on a web server go half bad, (just switched it to the other port), had a processor die on another web server (but the other kept working), had a whole slave db server die (it always sucked... it's been nothing but trouble from day 1), and now we had a disk die on another slave db server (but the hot spare kicked in and the array rebuilt within a few minutes).

Still ... all these issues mean we get stressed out and have to call vendors and get replacements. Most troubling is the dead slave db server .... that's some extra CPU that'd be nice to have considering we pulled two slaves out to be cluster slaves. But never fear --- we're just throwing in our backup machine into rotation. We normally don't use it so we don't get used to having it, but only throw it in when needed. We're also putting Jesus into slave rotation, since it's doing so much less lately, it can do more reads.

Request: Try to stay on topic. I get emailed all your replies, and I try to reply to people that have good questions, but I can't find the good questions if there are hundreds of jokes about naming a server Jesus. :-)
Subscribe

  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

  • 62 comments
Previous
← Ctrl ← Alt
Next
Ctrl → Alt →
Previous
← Ctrl ← Alt
Next
Ctrl → Alt →