bt (dwell ) wrote in lj_maintenance ,

Post-move update

We completed our data center migration on November 18, 2008, but the work hasn't ended. All of us have been impacted by problems after the move such as delayed email notifications, site uptime problems and well, I won't outline ALL of them here but you can check http://www.livejournal.com/support/ for the "Big Blue Box" (of despair! No, I jest... slightly). The only good news is that almost every problem is a "new" one; we're not seeing the same root causes ones over and over again.

I just wanted to let you know that the ops and engineering crew at LiveJournal feel your pain, probably in slightly different ways than you, for slightly different reasons. I'm not going to pretend to even start to comprehend what LJ means to you personally, but we're also LJ users too. I've found real friends through LJ, found it a source of release during my good times and bad times, and have discovered endless hours of lulz, information and insight in these pages, from people just like you.

In short, you don't like downtime, and we don't like it either. I'm very proud of the dedication and ability of the support, ops and engineering teams here (I respect all our groups here but since this is lj_maintenance ); the recent results for site uptime though, let's just say we've got a lot of room to improve and lots and lots of ways to make you happy with us. Understatement. And we're continuing to work through our growing and learning pains, involve those that know the intricacies of this site better than we do and... not give up. And I hope you don't give up on us either.


Enough talk, here's what the operations group (with help from a whole lot of smart people) is working on right now.


1) Due to our data center move and IP renumbering, a lot of the mail providers are looking at us suspiciously. This is not their fault. Due to the amount of UCE ("spam") nowadays, they have to protect their users (you). Some of the bigger ones have throttled the rate of email LiveJournal sends them. This has caused our Schwartz queues to back up. This delays emails for EVERYONE. In our efforts to work the queues -- which are not just for email but also for other notifications and events -- we've overloaded our Schwartz database a couple of times. When the Schwartz database chews on its own fat for too long, it hangs our web servers. Hence, downtime. Weird, right?
-- We're working on the emails RIGHT NOW. Nick, the man, has been wrangling the queues with Abe and Dormando's help all morning and we're hoping to get the emails waiting to be sent out from around 300,000 to zero. It'll take many hours though and it's a balancing act so unfortunately I can not give an ETA on this. It was over 600,000 earlier this morning!

2) We're trying to improve our network resiliency and perhaps even better connectivity to non-USA locations. Our 2nd internet provider started announcing our route earlier today which caused a loss of connectivity for anyone going through them. Which is weird because (1) I was prepending the hell out of our ASN through them, and external route servers like route-views.oregon-ix.net showed it was correctly making it through and (2) even if routing was asymmetric, hey it's the internet, TCP/IP deals with it. I tried to troubleshoot it for a good 5-10 minutes and eventually just withdrew our announcement through the 2nd provider for now.
-- Will have to revisit this though I'm very hesitant to do it until rest of the stuff stabilizes. Look for an announcement for some network maintenance in mid-December. Worst case scenario -- which I don't like at all -- is that I manually fail over (i.e., start announcing) in the case of problems with the primary path.
-- 2 more providers will be coming up within the next 2 months.


I'm opening this up to comments as lj_maintenance used to be, but I hope you understand we won't be able to personally address or answer all your comments; it's cuz we're working on the site! :D
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic
  • 149 comments
Previous
← Ctrl← Alt
Next
Ctrl →Alt →
Previous
← Ctrl← Alt
Next
Ctrl →Alt →