The Great Amazon Web Services Outage of 2017 is behind us. Now, Jeff Bezos’ golden child is ready to explain what happened. Turns out, what took Giphy, Medium, Slack, Quora and a ton of other websites and services down was a typo. As Amazon explains it, some of its S3 servers were operating rather sluggish, so a tech tried fixing it by taking a few billing servers offline. A fix straight from the company’s playbook, it says. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.” Whoops.
As for why the problem took so long to correct, Amazon says that some of its server systems haven’t been restarted in “many years.” Given how much the S3 system has expanded, “the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.”
Amazon has apologized and promises to do better in the future, at least, saying it has altered the at-fault tool (the code, not the employee) so it removes capacity slower. Beyond that, it is adding measures to stop so many being taken offline at once.