Brad Fitzpatrick (bradfitz) wrote in lj_maintenance,
Brad Fitzpatrick
bradfitz
lj_maintenance

Post Mortem

The site was pretty much down earlier, due to two unrelated problems which started almost at once. Things are doing fine now, though.

The issues were:

1) our load balancer (running a new version, as of last week) ran into a problem. vendor was notified and things were resolved. normally, traffic would fail-over to the redundant load balancer, but our redundant one is disconnected at this point because we kept the old software on that, in case we ran into problems with the new software and wanted to revert back. the problem was that between version, a default value changed (fin_wait_timeout on outbound NAT) and we were filling up tables. changing the value, as well as adding more source NAT addresses, more than fixes the problem.

2) one of our slave databases seems to have corrupted its index on one table, causing certain simple queries to stall. a quick index repair fixed it. (if anybody can find my old post where I describe in simple terms indexes vs. data, I'd appreciate it!)

Issue #1 won't happen again.

Issue #2 happens from time to time, but we were notified (in a roundabout way) so it was easy to fix. We'll change our monitoring to make the notification more explicit.
Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

  • 87 comments
Previous
← Ctrl ← Alt
Next
Ctrl → Alt →
Previous
← Ctrl ← Alt
Next
Ctrl → Alt →