I just wanted to let you know that the ops and engineering crew at LiveJournal feel your pain, probably in slightly different ways than you, for slightly different reasons. I'm not going to pretend to even start to comprehend what LJ means to you personally, but we're also LJ users too. I've found real friends through LJ, found it a source of release during my good times and bad times, and have discovered endless hours of lulz, information and insight in these pages, from people just like you.
In short, you don't like downtime, and we don't like it either. I'm very proud of the dedication and ability of the support, ops and engineering teams here (I respect all our groups here but since this is lj_maintenance ); the recent results for site uptime though, let's just say we've got a lot of room to improve and lots and lots of ways to make you happy with us. Understatement. And we're continuing to work through our growing and learning pains, involve those that know the intricacies of this site better than we do and... not give up. And I hope you don't give up on us either.
Enough talk, here's what the operations group (with help from a whole lot of smart people) is working on right now.
1) Due to our data center move and IP renumbering, a lot of the mail providers are looking at us suspiciously. This is not their fault. Due to the amount of UCE ("spam") nowadays, they have to protect their users (you). Some of the bigger ones have throttled the rate of email LiveJournal sends them. This has caused our Schwartz queues to back up. This delays emails for EVERYONE. In our efforts to work the queues -- which are not just for email but also for other notifications and events -- we've overloaded our Schwartz database a couple of times. When the Schwartz database chews on its own fat for too long, it hangs our web servers. Hence, downtime. Weird, right?
-- We're working on the emails RIGHT NOW. Nick, the man, has been wrangling the queues with Abe and Dormando's help all morning and we're hoping to get the emails waiting to be sent out from around 300,000 to zero. It'll take many hours though and it's a balancing act so unfortunately I can not give an ETA on this. It was over 600,000 earlier this morning!
2) We're trying to improve our network resiliency and perhaps even better connectivity to non-USA locations. Our 2nd internet provider started announcing our route earlier today which caused a loss of connectivity for anyone going through them. Which is weird because (1) I was prepending the hell out of our ASN through them, and external route servers like route-views.oregon-ix.net showed it was correctly making it through and (2) even if routing was asymmetric, hey it's the internet, TCP/IP deals with it. I tried to troubleshoot it for a good 5-10 minutes and eventually just withdrew our announcement through the 2nd provider for now.
-- Will have to revisit this though I'm very hesitant to do it until rest of the stuff stabilizes. Look for an announcement for some network maintenance in mid-December. Worst case scenario -- which I don't like at all -- is that I manually fail over (i.e., start announcing) in the case of problems with the primary path.
-- 2 more providers will be coming up within the next 2 months.
I'm opening this up to comments as lj_maintenance used to be, but I hope you understand we won't be able to personally address or answer all your comments; it's cuz we're working on the site! :D