Lisa Phillips (lisa) wrote in lj_maintenance,
Lisa Phillips

Update on servers, new hardware, Cartman outage

We've already begun the process of upgrading all of our older database machines. To give you an idea of the type of capacity we're looking at gaining, our newest cluster (put up about a month ago) is handling twice the active user traffic than any of our other, older database clusters without any problems. This hardware has been performing so dramatically better than the older servers that we've ordered parts to replace them all. Unfortunately we can't seem to upgrade them fast enough, and the site will get slow when one of the older clusters is too active. So we're hurrying as fast as we can to upgrade them all.

This week the Cartman cluster is getting the face lift. The hardware we're replacing should enable this cluster to perform at the level of our newest cluster (Porkchop), giving us a lot more breathing room to move users from other older clusters to this one as we continue our upgrades.

Unfortunately we ran in to a problem today while doing what should have been an unnoticeable part of the upgrade and the Cartman database was unavailable while we worked to remedy the problem. It is back up now and we apologize for the unannounced outage.

Our replacement hardware is almost ready to go in to production, though, so tomorrow (Wednesday) during the day we are going to make the switch to the much faster server for this cluster.

We don't foresee this causing a lot of downtime for users but we are announcing it now as a precaution. While we would generally wait until late at night to do work that might cause service interruptions, we feel the benefit of doing this work immediately outweighs the risk of a temporary outage.

Not only are we retro-fitting these servers with much better hardware, each cluster that is being upgraded is also being set up in a master-master configuration. This means we can take down one master database server for maintenance while the other remains online, with no noticeable difference to users. Being able to perform maintenance more routinely without disruption to you means we're in a better position to help prevent outages in the future.

In addition to all the work being done on existing cluster, we're ordering another brand new cluster to ensure we stay ahead of our growth.

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

← Ctrl ← Alt
Ctrl → Alt →
← Ctrl ← Alt
Ctrl → Alt →