The ATO says it will rebuild its internal IT infrastructure capability after two outages of its outsourced storage environment exposed poor system design and maintenance.
The tax office today revealed that in response to the outages of its HPE 3PAR storage area network (SAN) in December 2016 and February 2017, it would “enhance [its] IT capability pertaining to infrastructure design and implementation planning (particularly relating to resiliency and availability)".
“This should be done having regard to recruitment, engagement of contractors, and whole‑of‑government strategies,” the ATO said in a systems report handed down today.
The ATO said “planning [is] in progress” to make this happen.
Most of the high-level technical causes and system design issues for the December outage were revealed a week ago, including improperly fitted cables, inactive monitoring tools, and a SAN design that promoted performance over stability and resilience.
It was revealed today that the second major meltdown in February this year was the result of human error as HPE technicians tried to replace SAN cabling.
“Unfortunately, during one replacement exercise, we were informed that data cards attached to the SAN were dislodged,” the ATO said.
“This caused the 3PAR SAN to act in a similar way to that noted during the December outage. This included unsuccessful steps to automatically remediate, followed by a system shut‑down to preserve data integrity.”
However, it appeared the bigger issues were in the construct of the outsourced arrangement itself, and in particular the way HPE made system design decisions and kept tabs on the SAN’s performance.
The SAN infrastructure consisted of primary and secondary SANs in geographically-diverse locations in Sydney, with “regular” data replication between the two.
ATO said its IT staff had “no direct access” to the SANs.
One of the report’s damaging revelations is that the SAN configuration had been experiencing issues for six months prior to the first meltdown, and that the ATO had been kept somewhat in the dark over the severity.
“Analysis of SAN log data for the six months preceding the [December 2016] incident indicated potential issues with the Sydney SAN similar to those experienced during the December outage,” the ATO said.
“Specifically since May 2016, at least 77 events related to components that were observed to fail in the December 2016 incident were logged in our incident resolution tool.
“In addition at least 159 alerts were recorded in SAN device monitoring and management logs (SNMP logs).”
HPE took action by replacing some cables connecting the SAN, but the ATO said the alerts continued.
“We were not made fully aware of the significance of the continuing trend of alerts, nor the broader systems impacts that would result from the failure of the 3PAR SAN,” the ATO claimed.
The December outage also exposed a number of design decisions taken by HPE engineers.
“The SAN was neither designed nor built to cater for greater than single drive failure or single cage failure,” the ATO said.
“The SAN build [also] included ‘daisy‑chain’ cage configuration which exacerbated the risk of errors spreading across cages as occurred during the incident.”
The ATO said there was no evidence HPE had evaluated other configuration options during setup.
As the December outage began, more system design flaws were exposed, which the ATO said led to the outage being more prolonged and severe.
“This particular SAN configuration leverages a feature known as wide-striping which is designed to significantly improve performance by reading and writing blocks of data to and from multiple drives at the same time, preventing single-drive performance bottlenecks,” the ATO said.
“When several physical disk drives were impacted by a drive firmware issue which prevented those drives from re-booting, the result was that a small number of drives temporarily and in some cases permanently prevented access to a significant amount of application data.
“This also had the effect of extending the duration and complexity of the recovery effort.”
In HPE’s defence, the ATO said that “this particular combination of events has not been previously experienced in relation to HPE 3PAR SANs".
The ATO cautioned that it still cannot be certain of the “root cause” of the issues that downed the SAN.
“Root cause examination cannot be completed until the SAN is physically removed and taken back for forensic testing. This process may not be completed until late 2017,” it said.
The ATO said it will decommission the current SAN in July, allowing HPE to ship it elsewhere for forensic analysis. It will be replaced with a newer 3PAR SAN with better data replication, failover, backup and monitoring than the prior system.
The agency said its experiences would “inform our future IT acquisitions”.
“As contracts come up for renewal, we need to balance service, stability, resilience and cost,” it said.