Earlier this week, we learned that a huge number of websites had been knocked offline due to an Amazon Web Services outage. Since then, Amazon has been investigating exactly what brought down its widely used platform, finding that a small typo actually caused the downtime.
It turns out that some of Amazon's S3 servers weren't running as well as they should, so a server tech went in to take a look and decided to take some of Amazon's billing servers offline. Unfortunately, one of the commands was entered incorrectly, shutting down a larger number of servers and affecting services that should have remained online.
As for why it took so long for services to get back online, it turns out Amazon's systems haven't been restarted “in many years” so the process of getting everything back up and running took a while. The full post goes into much more detail on the technical ins and outs of the error and how it was fixed but that is essentially the gist of it.
In an effort to ensure this doesn't happen again, Amazon has gone over some changes it is making: “We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future.”
KitGuru Says: While it was quite a surprise to see such a vast amount of websites suddenly knocked offline this week, it appears to have highlighted some issues with Amazon's system which will now be improved.