Avoid outages and achieve the fabled 'four nines'

If a company’s website was unavailable for five minutes each year, or even just slowed down significantly, what impact would it have on business? What about eight hours of downtime: what impact would that have on its revenues?

These downtime figures might seem random, but they are exactly what a cloud provider is referring to when they advertise 99.9 percent or the hallowed ‘five nines’ of 99.999 percent availability for their services. The actual outage periods that a customer’s website might experience in a given year could be far greater, especially when accounting for service issues or potential faults during critical periods, such as campaigns.

But it is possible for clients to ensure online services and websites are available all of the time, 100 percent. This doesn’t mean keeping every single compute or web node online at all times. Instead, architecting services for 100 percent uptime means that, even if an entire cloud availability zone or data centre failed, end user customers will not be impacted and the website will remain fully visible.

Too few businesses fully appreciate the potential impact these failures might have. Even some businesses that live or die on web services fail to prepare for a massive cloud outage.

The announcement in late 2014 that Amazon would have rolling, minutes-long reboots to its cloud computing service in order to issue a security patch, although unexpected, provided a perfect example of why customers should consider the impact of downtime. In this case of AWS, we planned ahead to ensure our clients had at least one web node always running during the reboots and their service was still available.

In consultative engagements with our customers, we often ask what they are trying to protect against, the potential impact of any downtime and risk buffer time. For some, the threat of lost revenue is most important. Any second lost is revenue and potential sales down the drain. For others, the potential impact that an outage could have on their brand reputation is vital.

To avoid either disaster scenario, businesses need to architect for failure and test those failure plans. This means something wholly different than the backup or disaster recovery systems that many businesses are used to. In the old world of data centres, businesses could have one active and one passive node. If one failed, it was standard, and often acceptable, to take four to six hours to get back up and running.

In the cloud, architecting for high availability means that any failover between an online service’s primary and secondary nodes must happen in a matter of milliseconds, unnoticeable to end user customers. Both nodes must effectively be online and active at all times, rather than one being turned off, or passive.

So how do you reach that Holy Grail of 100 percent uptime? By making sure every single piece of a customer’s service stack is highly available, and duplicated so that any one failure can’t bring down the entire stack.

This means looking beyond the obvious – like web nodes and databases – and ensuring everything required to keep that service afloat is architected for failure: the load balancers, domain name system servers, content delivery network. These are things that, even in recent months, have led to service outages because businesses didn’t consider them potential weak links.

Customers need to be warned never to underestimate or overlook what they might think seems a minor part of a deployment. Having multiple web nodes but a single database server or poorly configured DNS service may lead to trouble.

Having a plan that is automatically implemented in case of failure, by duplicating these elements – not just across multiple data centres, but also multiple countries if possible – is the best way to get to that 100 percent mark. For many companies, this is about weighing up the commercial viability of additional investment in cloud, and their bang for buck, against the potential for harm in the event of an outage.

If a business is doing tens of thousands of dollars per month through an e-commerce site, but paying for an overseas-hosted shared web environment, they should be prepared to face some downtime.

Challenges remain, and success requires creative thinking. For example, depending on the application, ensuring that the business is adequately licensed to operate two or more versions of each application is an important consideration, particularly for older applications that were not built for a cloud environment.

Customers must also consider the need for a robust change control policy, to ensure each node is completely up to date, no matter what the IT team pushes into production. Having a test environment and significant user testing of the website before it is pushed live or updated significantly is also important.

The decision to aim for 100 percent uptime is not an easy one for everyone. It’s a matter of determining acceptable risk in the event of a cloud failure. Nor is it a matter of flicking a switch: it takes careful planning and consideration, accounting for the peculiarities of each company’s cloud environment.

For some businesses, it may become too expensive to make everything 100 percent available all of the time, but it is important that businesses recognise what aspects of their online services must be available to ensure they run smoothly and without fault.

But for those who truly want it, the Holy Grail is there for the taking.

Breakout: Back to basics

Customers must determine their risk profile in the event of a massive cloud outage, and must have a plan for failure.
Consider whether the website needs to have the fabled ‘five nines’ (99.999 percent) availability for their services.
Weigh up the commercial viability of additional investment in cloud versus potential harm in the event of an outage.
Architect for failure and test those failure plans. Will the website remain fully visible even if an entire cloud availability zone or data centre fails?
Never underestimate minor elements. For instance, having multiple web nodes but a single database server or poorly configured DNS service may lead to trouble.
Consider licensing issues in an active-active configuration. Is the business licensed to operate two or more versions of each application?
Work out your own risk – what would happen to your company and your customers if a cloud failure happens?

Gary Marshall is director of managed services at Bulletproof Group, which appeared in the 2012 and 2013 CRN Fast50 awards.