Here is a good explanation of what has been up with LiveJournal, just so you and your friends
are informed :) If you see someone complaining/asking about what's up with LJ, point them to this community, have them friend it, whatever.
First is a non-technical explanation, then a technical explanation...
For servers, we have one main webserver, and currently four other webservers to spread the hits across. The main webserver spreads the hits out, but wasn't doing its job so well. There are some bugs in the OS, and in the webserver software that have been messing it up. We got the person who wrote the load-balancing software we use to help out, and we have since fixed all but one major bug.
So, it used to be that the main webserver was slow, crashed every half an hour, and dropped connections. Then it was fast, crashed every half an hour, and dropped connections. Now, it's fast, doesn't crash anymore, and drops connections. We're working pretty hard on getting that last bug out of here :)
The technical explanation (sorry, I wrote a small novel):
Apache processes on kenny were getting stuck in the "W" (writing back to client) state, and never coming out of it. So we'd eventually max out the service at 350 apache processes or so, the processes would all hang, and everyone would be sad. So we set up some scripts to gracefully restart apache every half an hour. Half an hour became fifteen minutes, and then ten minutes... Kept getting worse... Kenny was also handling an awful lot of requests that were supposed to be load-balanced to the other machines, so kenny was really slow all the time.
We got Theo, the maintainer of mod_backhand to help look through some logs, and I went searching through some logs as well, and we found out that the apache processes were having connection problems in mod_backhand when trying to load-balance the request. They were erroring out and defaulting to handling the request locally. A small patch from theo fixed this problem. Now kenny was fast, the service screamed, but apache still crashed.
Theo and everyone else went back to work... Theo found out that mod_backhand was hanging again trying to read data from the moderator process (the process that collects information from the other machines and helps delegate work.) He set some hard process timeouts in the code, and
now the apache processes time out instead of hang. Now apache doesn't crash, but people lose connections a lot. We're investigating the last bug now. Kenny's 90% idle, so if we can fix this, we better well be set for a while :)