Lisa Phillips (lisa) wrote in lj_maintenance,
Lisa Phillips
lisa
lj_maintenance

When it rains it pours

We've had a number of events occur lately that has led to downtime for journals and general slowness for the site. We wanted to take a moment to discuss a couple of the problems, and to let you know we're doing everything we can to fix these issues with the least amount of impact on you.

The short/less geek story is we've run into some pretty major issues with a specific piece of hardware we have in several of our database servers, and we're replacing them all as soon as we can.



Specifically, we've been using the LSI Megaraid 4 channel scsi raid card with battery backed cache. The performance of these cards in general has been very impressive- we started using them back in December and since then have gotten a lot more out of our databases than we used to. Unfortunately, there have been some cases where disks are going bad much more often than they should, and some arrays have become corrupt when they shouldn't, forcing us to restripe and rebuild. Thankfully the vendor we work with (Silicon Mechanics) has been very responsive and is working closely with us to identify the problems and replace hardware when necessary. We've begun the process of replacing these cards with a very similar Intel card we're more confident we'll have less problems with. We use Intel scsi raid cards in other machines and have had no problems with those.

As for the power failure Monday night. That was a result of us drawing too much power in our cabinet-- on both our primary and secondary circuits. We are having new power circuits installed tomorrow in all of our cabinets, as well as ordering power metering equipment to make sure we don't experience a failure like that in the future. Also we'll be moving more equipment between cabinets to make sure that in the future if we ever lose all power in one cabinet we won't lose a whole database cluster or service.


Its never our intention for users to experience any downtime. We've worked hard over the last year to make our site architecture as redundant as possible so we can have unexpected failures without impacting the site. Whenever we are dealing with that kind of situation, keeping the site up and your journals available is our highest priority.

We're very sorry for any inconvenience this has provided. We're working on these issues around the clock and will continue to keep you updated.
Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

  • 273 comments
Previous
← Ctrl ← Alt
Next
Ctrl → Alt →
Previous
← Ctrl ← Alt
Next
Ctrl → Alt →