Brad Fitzpatrick (bradfitz) wrote in lj_maintenance,
Brad Fitzpatrick
bradfitz
lj_maintenance

Good news!

I'll present the following good news in both English and Geek. Pick the column you understand.

English Geek
The long-running problem we've had with our main database not accepting connections is finally solved, I believe. This has been responsible for probably 90% of our problems over the past month. If you master database doesn't accept new connections, everything blocks, because nearly everything needs to write at one stage or another. A slave db can be dead and we just take it out of rotation. Not so with the master. So what I discovered was happening was that mysqld was blocking trying to reverse hostnames sometimes. Since that system call isn't thread safe, mysqld grabs a mutex in the master thread which accepts new connections and resolves there, blocking all connections. This is despite passing --skip-name-resolve to mysqld, which we did weeks ago guessing this might be the issue. But our internal DNS is sound! We have it load balanced across two machines and it's always available on a virtual IP. I checked netstat -p, grep -v'ing away all the normal good crap. I noticed that whenever the master db was blocking, there was a open udp connection (what's it mean to be open in a stateless protocol?) to 10.2.0.1 (our DNS VIP). So even though we told it not to resolve, it was anyway. Screw you, mysqld. So I changed /etc/nsswitch.conf to just not do DNS at all. We don't need it anyway. Now it seems to be cool. I noticed on the bigip (load balancer) that the connection was just hanging there also between jesus and the load balancer. I tried to get in contact with my bigip masta' friend, but he wasn't around. I'll ask him about this later, but for now DNS is forcibly off at two layers, since mysql doesn't obey.
One of our many read-only database servers is now using a new type of backend which enables visitors to both read from it at the same time it's updating from the main database. We're using InnoDB on a slave now. It only took us 5 times, 4 segfaults, and 2 bug reports to Sweden & Finnland too. But they fixed up their B-tree code, sent me a patch, I rebuilt a mysql-server deb, and the ALTER TABLE ... TYPE=INNODB on the whole database completed after a few hours. Kenny's now replicating in the slave pool like a champ. We're going to give it traffic pretty soon here and see how well it does. From everything we've read, it really kicks ass. The little bit of playing with it I did while testing amazed me. Go read innodb.com if databases are your thing. It's impressive. No more blocking selects during updates! Hooray.
Did operating system upgrades today. Things are more reliable now. Over the next week or so we'll be upgrading all the read-only databases to be faster, also. All our important machines are now running new 2.4 kernels with the aa VM, not Rik's from the ac kernel. Linux 2.4 is starting to get stable! Hooray. We're running ext3 (which is now in 2.4.15-pre, without a patch!) on the webservers, where hardly any disk activity occurs. The slave dbs slowed down when we went to ext3, but because we were in data=ordered mode, not data=writeback. But screw filesystems! InnoDB can use a raw partition for its table space, so that's what we'll end up doing for the rest of the slave dbs once we convert them one by one to InnoDB.
Subscribe

Comments for this post were disabled by the author