Australian cloud provider Ninefold has been unable to pinpoint the exact cause of a five-hour outage last week that led to virtual server provisioning in its cloud environment being disabled.
The outage started at 9.20am on Thursday February 16 and was resolved at 2.05pm the same day, according to a post-incident report obtained by iTnews.
The report initially pinned the incident on "an unexpected failure" of an NFS server that "automatically restarted itself and resumed normal operation within minutes".
"A number of physical host servers (which support customer VMs) were performing operations to the NFS server at the time of the unexpected Network File System server failure," the report stated.
"Due to the nature of the NFS server failure, this caused provisioning on these specific physical host servers to become unresponsive.
"Eventually, some customer VMs on these particular host servers became unavailable and unable to restart on alternate physical hosts."
Ninefold managing director Peter James told iTnews today that the fault was in the NFS server hardware "which ultimately transmitted into a software failure".
James could not say how this occurred. "We're still working with engineers to determine the absolute underlying events," he said.
He said patches had been applied to the NFS storage system and that further upgrades were in the pipeline.
Provisioning taken offline
Ninefold was careful in describing the outage, stating it was "not... system wide".
The outage was confined to part of the virtual server side of Ninefold's business, where "some" virtual server instances stored on "a number" of physical host servers connected to the problematic NFS server were impacted. The outage did not touch a "large" number of virtual servers or physical hosts and had no impact on Ninefold's cloud storage business.
However, the outage also meant that no customers - across the board - could spin up new virtual servers over the five hour period.
James confirmed that provisioning for all customers was disabled so that engineers could restore the VMs that had failed.
"We took a decision based on advice from our engineers," James said.
"The issue is that if we've got customers, albeit a small number of them, [that] are down, they're the ones that we absolutely focus on.
"We took a decision to take the appropriate steps to get them back up and running, albeit that on a Thursday it did mean that a number of customers couldn't provision but it did mean that we were able to reasonably quickly get that small number of customers who were more affected back up and running."
The post-incident report does not mention the number of physical host servers or customer VMs impacted and James declined to elaborate.
New availability zone
In response to the outage, James said Ninefold has doubled its investment in an already-planned core infrastructure enhancement project to $1 million. Installation work is expected "over the next few weeks".
The provider also plans to launch a second availability zone hosted in Macquarie Telecom's forthcoming Intellicentre 2 facility in North Ryde.
The zone is expected to be live in May, a month ahead of earlier expectations.
James said Ninefold would make a "significant investment as the anchor client" in the data centre.
He said Ninefold was "yet to work the detail out" on whether it might run its presence in Macquarie's two data centres in an active-active configuration.
"We're already underway with our planning, but we're yet to make that call," James said.
He said the launch of the zone was "largely aimed at increasing the resilience" of the service as it scaled up to meet customer growth.
Ninefold previously suffered a host server outage in August last year and another incident a few months earlier in May 2011.