Revealed: Lessons from WebCentral's 72-hour email outage

At 1:00AM this past Tuesday, technicians at Melbourne IT-owned hosting company WebCentral detected a fault on what iTnews now understands to have been an IBM SAN (storage area network) controller.

It was the first of a consecutive series of system failures that tested the company’s capacity to deal with a crisis.

In the aftermath of the incident, Melbourne IT chief technical officer Glenn Gore has been left the job of explaining what went wrong.

Gore conceded that Melbourne IT failed to adequately communicate with customers and that a 72-hour outage is simply not good enough.

But he insisted there was a good reason it took so long to restore service.

“We might have looked a bit slow in our response, but the integrity of customers email was absolutely the top priority. There was a lot of checking and rechecking of data,” he said. “To the best of our knowledge, we haven't lost any mail.”

The outage, blow-by-blow

Tuesday 1:00am

WebCentral technicians note that one of the controllers on a storage array has failed, creating instability in the mail platform.

The SAN in question supports a number of services, none larger than WebCentral’s shared email, and is storing some 20 terabytes of data.

Gore refused to divulge which vendor supplied the array that failed, but iTnews has since learned it was an IBM array.

Gore insisted it was not the same SAN blamed for a WebCentral web hosting outage a month earlier. That system is hosted in a separate data centre altogether, he said.

Tuesday 8:00am

The SAN is operational but not behaving properly. As users come online to check their mail, the mail platform (made up of eight front-end processing servers and ten back end services connecting to some ten terabytes of mail data) cannot keep up.

By 9:30am customers are noticing problems with accessing their email.

The underlying storage unit fails. WebCentral appoints a “Critical Incident Manager” and an incident team to deal with the problem.

Tuesday afternoon

WebCentral technicians spend most of Tuesday focused on recovering the SAN, with the help of staff from the SAN vendor in question.

“It was effectively offline,” Gore says. “It took all day to get the SAN to a recoverable state.”

A flood of customer calls wash in reporting connection issues.

Melbourne IT’s communication teams record a message informing customers of the problem, which is set to automatically play when customers call in. But the volume of calls is so large the recorded message feature stops working.

Melbourne IT estimates that only 25 per cent of callers ever heard the message.

WebCentral technicians work into the evening and are able to restore the SAN without having to revert to back-up data.

Wednesday 6:30am

A morning shift of technicians arrive at 6:30am to relieve the night shift, which works through until 8:00am and 9:00am to ensure a smooth changeover before the morning rush of email.

The team start the mail system servers back up to look at the file system and see what state it is in. All the servers mount the file systems that hold the mail data successfully.

Wednesday 8:00am

As the morning load of email comes on, the mail platform begins suffering from data corruption issues. Some of the back end services are crashing and coming back online with data corruption errors.

Fearing the potential for data loss, WebCentral technicians take the system offline and begin analysing the file systems to check on the integrity of the data.

With the system offline, customers again have no access to mail.

Wednesday 12:00pm

By midday, WebCentral has lost 50 per cent of its 10 mail stores due to corruption.

The sheer size of the data stores in question made checking the integrity of the data a long and laborious process, Gore said.

“Your best case scenario would be for each of the three integrity checks [required] to take two to three hours per message store. You run the integrity check once without modification, which is two to three hours. You run it again with modifications enabled: another two to three hours. Then you run it for a third time to make sure the modification hasn't caused problems.”

“We had to run each of these checks five and six times per message store to make sure we cleared the corruption,” Gore said. “We didn't bring the servers back online until we could guarantee there was no corruption in a given mail store.”

These data integrity checks continue into the night and Thursday morning.

Thursday 9:00am

By the time the morning rush hits on Thursday, WebCentral is confident the SAN and mail stores are 100 per cent back online.

At 9:00am customers are able to access some of the backlog of emails, but by 9:30am connection issues set in from the large amount of customers trying to access accounts.

Gore claims WebCentral maintains 50 per cent excess capacity on its SAN and server processing capacity to “soak up spam attacks” and the like. But with two days of queued e-mails coming back online, the system can’t handle the load.

“Mail systems were trying to deliver two days of queued mail,” he said. "Plus we had customers trying to access their mail on clients.”

The mail platform behaves inconsistently up until lunchtime.

Friday 12:40am

By 12:40am there is a “big improvement in the behaviour of the mail platform.”

WebCentral engineers make a few tweaks to the routing on their network to improve speeds. But there are still intermittent errors – due mostly to the volume of queued mail.

“By this time customers had been down a couple of days,” Gore said. “Many had changed the settings on their client looking for a workaround”.

WebCentral attempted to advise a small subset of customers to revert their settings back to normal to begin receiving mail again.

Technicians spent the remainder of Thursday night making minor changes and communicating with customers.

Friday 9:00am

Gore and his team believe there to be no further problems with the mail platform, but hold off officially calling it resolved until well into the afternoon, paying close attention to how it handled the morning load.

Customers report the system is working as per normal.

Click through to Page 2 to read about what WebCentral has learned from the outage.

Next steps

“We understand this was a large outage,” Gore says. “We have identified a number of changes we will make.”

The first step, Gore said, is to improve the means by which WebCentral can communicate with customers during an outage.

It will build a new ‘service status page’ on its website with a greater amount of detail.

There will also be a new phone system installed to handle higher volumes.

And while he won’t get too specific, Gore said the company will make “significant changes to the way we deploy and manage our storage environment.”

“When you buy hardware for shared hosting, its hard to get the hardware vendors to agree to SLAs [service level agreements],” he said. “They know there is a flow on affect to thousands of customers and that the infrastructure it supports changes dramatically over time.”

Gore said he has heard the numerous calls from affected customers for some kind of compensation, a matter that has been referred to Melbourne IT’s board.