Cisco has revealed why its WebEx Teams cloud collaboration service went down: it deleted its own virtual machines.
No, it’s not 1 April and we are not making this up.
WebEx Teams went down on 26 September Australian time. It struggled back to life over several days.
Cisco sent us a statement about the outage that said “The service interruption was the result of an automated script running on our Webex Teams platform which deleted the virtual machines hosting the service.”
The statement continued: “This was a process issue, not a technical issue.”
“We continue to investigate the causes of the script being run, however we are confident that this is an isolated incident and processes are in place to prevent any recurrence,” the statement added.
CRN understands that WebEx runs in AWS, but the cloud service is blameless. The fault here is all Cisco’s.
Clearly this is a significant failure of processes at Cisco: it just should not be possible to delete assets of such importance!
But Cisco is not alone in breaking itself with bad scripts: in late August 2018 IBM’s cloud was unable to provision new instances for around eight hours. Big Blue’s incident report said the cause was “a change … made to one of the scripts that creates the device host names. This change inadvertently resulted in the device host names not resolving and subsequently the authentication issues.”
Cloud services are promoted as more resilient than it is possible for most organisations to create by themselves and cloud operators promote themselves as possessing operational smarts that make that resilience possible. Clearly those claims need to be tested.