AWS Sydney outage caused by failure in primary and backup power

By on
AWS Sydney outage caused by failure in primary and backup power

Amazon Web Services has outlined the problems that caused the outage in its Sydney availability zone on Sunday night.

The outage at the Elastic Compute Cloud (EC2) and Elastic Block Storage (EBS) instances knocked out numerous big-name Australia websites, including Foxtel Play, Channel Nine, Presto and Stan.

In a blog post, AWS explained that each instance is powered by two independent sources: the main utility power and a backup generator diesel rotary uninterruptible power supply (DRUPS) that stores power and fires up if the first UPS falls over.

When the power utility died due to massive storms in Sydney, the DRUPS shut down, meaning the backup generator was unable to startup.

"Normally, when utility power fails, electrical load is maintained by multiple layers of power redundancy. Every instance is served by two independent power delivery line-ups, each providing access to utility power, uninterruptable power supplies (UPS), and back-up power from generators. If either of these independent power line-ups provides power, the instance will maintain availability.

"During this weekend’s event, the instances that lost power lost access to both their primary and secondary power as several of our power delivery line-ups failed to transfer load to their generators."

Rather than a complete outage, AWS witnessed an “unusually long voltage sag”, which meant a set of breakers that separate DRUPs from the utility power didn’t open fast enough, dumping its reserve energy into the power grid.

“The rapid, unexpected loss of power from DRUPS resulted in DRUPS shutting down, meaning the generators which had started up could not be engaged and connected to the data centre racks. DRUPS shutting down this rapidly and in this fashion is unusual and required some inspection,” according to the company.

A "latent bug" in the instance management software meant some restoration times took longer than expected, while others had to be manually recovered.

Some EBS customers lost their data as a result – AWS said less than 0.01 percent of instance volumes were were unable to recover after power was restored. A small number of hard drives also failed, which rendered the data unrecoverable.

AWS said it would add extra breakers which would allow generators to "more quickly break connections to degraded utility power to allow our generators to activate before the UPS systems are depleted".

The cloud provider will also be improving its recovery system, saying that a fix to the issue that led to customer instances not being automatically recovered will be deployed in the coming days.

“We apologise for any inconvenience this event caused. We know how critical our services are to our customers’ businesses. We are never satisfied with operational performance that is anything less than perfect, and we will do everything we can to learn from this event and use it to drive improvement across our services,” said AWS.

The outage lasted approximately 10 hours from Sunday afternoon to early Monday morning.

Got a news tip for our journalists? Share it with us anonymously here.
Copyright © nextmedia Pty Ltd. All rights reserved.
Tags:

Log in

Email:
Password:
  |  Forgot your password?