bt (dwell) wrote in lj_maintenance,
bt
dwell
lj_maintenance

  • Mood:

Recent Website Problems and Solutions

I wanted to give you an update as to all the frustrating problems we've been experiencing in the last 3 to 4 weeks. If you've received timeouts, blank pages or journals, comment threads or individual entries not loading at all -- or not looking like they should (i.e., without style sheets) when they *finally* do load -- then this will be of interest to you.

The cause is basically, simplistically, an overloaded database server. Here's what we plan to do about it within the next 5-7 days:
  1. Speed up the way our user information is pulled out of the database.
  2. Out of the 10 servers where we store our data, the first one is really struggling to keep up with the demands on it, we plan to move some of the users on that first server around so that the requests for information are spread around better.

Even if your journal, or the journal you are trying to look at is on a different server, we're all still affected because everyone has to wait in line for their turn to get to their server, only now the entire line is backed up.



First, we are fixing some of our SQL queries on our User Cluster databases.
We found out that one of the most frequent requests (up to 40%) was NOT configured to go through our Memcache tier, meaning every one of those SELECTs was hitting the database. A change was committed to the next release which is scheduled for 2 days.
There are also some other miscellaneous optimizations which should decrease the time the database has to spend in answering them.

Second, we're moving the biggest community off of our overloaded database server onto the smallest database. Then, in typical operations fashion, we're going to upgrade the hard drives in that database from good ol' sata drives to something less... spinny. :D mhwest is on the phone right now trying to arrange the purchase and shipment details; I'm shooting for this coming Monday when we can have both the upgrade and the move finished.

And finally, we're going to have other people with fresh eyes and different experience and skill-sets take a look at parts of our systems and see how we can improve things. This will be on-going and the changes, if any, will be rolled out over the medium to long term, but is scheduled to start in 2 weeks.



If any of our plans change, I will update THIS post with the edits. However, if we need to take a site-wide maintenance window for the hardware update or user move, we will create an entirely NEW entry for that specifically.
Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

  • 306 comments
Previous
← Ctrl ← Alt
Next
Ctrl → Alt →
Previous
← Ctrl ← Alt
Next
Ctrl → Alt →