IBM admits its own errors led to multiple cloud crashes

IBM has finally figured out why customers spent weeks struggling with VSI provisions and reloads for its Information Management System.

As CRN reported, problems first appeared on 1 April when IBM noticed VSI provisions and reloads were being processed slower than usual. Systems engineers traced the issue back to an abnormally large transaction which overwhelmed IBM's available resources, blocking subsequent transactions and causing them to fail.

IBM eventually cleared the deadlock by restarting affected systems, allowing transactions to proceed. However, the problem reoccurred eight more times over the next three days, requiring a system restart each time.

IBM has now revealed what went wrong.

The outage was caused by three independent issues, the first of which occurred eight months before any problems showed up.

In August 2018, IBM inadvertently disabled the service that helps detect and auto-correct specific errors while implementing a performance improvement configuration. Two months later, engineers bypassed customised configurations needed to operate IBM's database during a scheduled update.

A week before the first outage, IBM worsened the issue by adding additional resources to the database.

These issues alone didn't cause any strife. But when the large workload hit internal systems on 1 April, the issues flared.

IBM has since improved database monitoring to include detection of these issues. Big Blue has also scheduled an update for 11 May to correct the software defect and prevent the necessary configurations from being bypassed.

"We sincerely apologise for any inconvenience that this incident may have caused," IBM said in its explanation of the problems.