Thu, 02 Aug 2012

Explanation of today's downtime

Today, more or less all of's infrastructure fell over. This blog post is an attempt at explaining what went wrong and what we're doing to prevent it from happening again. I'd also like to apologise for the outage being much longer than what I consider reasonable.

To understand some of the background, it's useful to look a little bit back in time, when Intel, HP and Google together enabled us to buy some new servers to replace the aging existing ones. We have slowly moved servers into virtual machines, the latest being the bugzilla frontend and backend moving into two new VMs. There are still a few servers left, but we'll get rid of them in the not-too-distant future, or at least, that's the plan.

Today, two of those new machines failed. One, teodor, hosts cgit, annarchy and an admin VM. This seems to have been a kernel bug in KVM land somewhere. For some reason, the other primary server, lyle, also stopped responding at the same time. I'm not sure why that happened, since it was seemingly ok from looking at logs. All this happened this morning, european time. For various reasons, I did not have the iLO passwords, so until I could obtain those, there was no way for me to reset the system, and due to timezones, this took a while. Once I got the passwords, rebooting the systems (as they were unresponsive) was done quickly enough and everything recovered after.

What are we doing to make sure this does not happen again? Primarily, more people now have the iLO passwords, meaning we should be able to respond much quicker. We're also distributing the contact information for the various people a bit, so if shit hits the fan again, getting hold of the right people will be easier.

tfheen
