Cloud need not fear data centre outages

By Mahesh Krishnan on Feb 9, 2015 6:49AM

The ups & downs of AWS

Announcing Pipeline 2025 tickets, theme & initial speakers

Australian MSP Index launched

End of Support for Windows 10: A guide for Channel Partners

Ingram Micro Ushers in the Age of Ultra

The cloud is all about elasticity, on demand and cost savings, but it’s also multi-tenanted and built with commodity hardware, thus prone to failure.

If you’ve deployed your application in the cloud and are offering a string of nines in your service-level agreements, this could mean trouble.

You would recall outages from both Microsoft Azure and Amazon Web Services towards the end of last year, which affected many customers.

Let’s be clear: I see Microsoft and AWS as the only genuine public cloud infrastructure providers in Australia. Everything else is prohibitively expensive and not genuine cloud. What do you need to do to your application to survive a data centre outage, when even the big guys can go down?

Rather than avoid failure, architects and developers need to ensure their applications embrace failure and deal with it.

First, you need to model the application to understand the different workloads the application deals with, what the load life cycle of the application will be, what its SLA will have to be and at what points failure can occur.

Once this has been identified, you can apply several patterns while building the application. These patterns loosely fit into the category of the three Rs: resiliency, redundancy and recovery.These include doing things asynchronously, which will give you operational autonomy, applying time-outs, and activating circuit-breaker patterns.

You also need to ensure that when failures occur, you have one or more strategies in place to handle them, such as doing retries, and having graceful degradation so that certain parts of the application continue to operate while others don’t.

Eliminating all single points of failure, and having redundancies is crucial for attaining multiple nines in your SLA .You can run multiple instances and have duplicated data within the same data centre to start with but, eventually, you will have to start thinking about how you can replicate this model across multiple data centres, or even across different cloud providers.

A very simple design for a web application running on Microsoft Azure could be to have a website across multiple data centres and SQL databases replicated across these data centres, splitting up the reads and writes from the web application, and using patterns such as Queue-Centric Workflows – and, finally, using Traffic Manager to route the traffic to the right website.

Even in this simple design, any failure with the infrastructure can easily be absorbed, and the application made to switch over to the secondary. There could be some kind of a graceful degradation when things go dramatically wrong, but the application continues to run and is soon back up and running as if nothing went down.

Netflix is a great example of resilient architecture, running all of its infrastructure in the cloud. The company uses Chaos Monkey, which constantly breaks instances in its production environment on purpose, forcing it to build its application to be resilient to failure.

Netflix proves that if you architect your applications for resilience, you can nullify outages that affect individual data centres, and still offer customers continuous uptime.

Mahesh Krishnan is head of product and client services at Readify, headquartered in Melbourne. He recently gave a talk with John Azariah at NDC London on building resilient architectures for the cloud.

Got a news tip for our journalists? Share it with us anonymously here.