Amazon Web Services has fessed up to the cause of a large-scale outage in its US East 1 region this week that impacted major global internet sites, including Slack, Adobe, Pintrest and Airbnb.
The cloud giant revealed in a blog post that a large number of servers was accidentally removed while debugging a problem in the billing system for its Simple Storage Service (S3) when an employee entered an incorrect input during maintenance.
The staffer was attempting to execute a command that would remove a small set of servers for one of the S3 subsystems, but caused more servers than intended to be removed.
Support for two other S3 subsystems was inadvertently removed, one of which was the index subsystem that manages the metadata and location information of all S3 objects in the region.
"This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests," AWS said in a blog post.
"The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects."
S3 services run out of the Northern Virginia region were unavailable while AWS was forced to do a full restart on affected systems. Services that rely on S3 for storage were also unavailable during the restart, including the S3 dashboard, new EC2 instances, EBS volumes, and Lambda.
To further exacerbate the problem, AWS had not done a full restart of the index and replacement subsystems in several years, leading the restart process to take longer than expected.
"S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected."
During the outage, AWS had to use its Twitter account to update customers on the remediation process because its Service Health Dashboard was down due to dependency on S3.
AWS has added safeguards to prevent the same blunder from happening again, including modifications to stop too much capacity from being removed too quickly.
The company used the blog post to "apologise for the impact this event caused for our customers".
"While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses," AWS said.
"We will do everything we can to learn from this event and use it to improve our availability even further."