Google’s terrible, horrible, no good, very bad fortnight

When Gmail and Google Drive browned out last week, it drew a lot of attention to a big Google SNAFU. But what went unremarked-upon was that Google has actually had a horror fortnight, with errors-a-plenty across multiple services.

The not-very-much-fun kicked off on March 5^th with an incident that caused virtual machine connectivity issues across the northern hemisphere.

In following days the company experienced brief problems with a Kubernetes service, cloud routers, and the Dialogflow conversational interface service.

Google Dataflow also had a multi-hour wobble, App Engine also had a bad day on March 12^th and Cloud Console had a worse day on March 14^th.

Then came the cloud storage outage, all four hours of it.

Google’s published root cause analyses for a few of the outages and they reveal that these outages were mostly the company’s fault.

The cloud storage outage was caused by “a configuration change which had a side effect of overloading a key part of the system for looking up the location of blob data. The increased load eventually lead to a cascading failure.”

The Cloud Console crash was explained as “a code change in the most recent release of the quota system introduced a bug, causing a fallback to significantly smaller, default quota limits, resulting in user requests being denied.”

The App Engine problem looks to have been caused by the storage incident.

Google publishes verbose incident reports, unlike some of its rivals, and is brave to do so given its market share trails AWS and Microsoft. But perhaps its willingness to be so frank about its failures shows why it’s not as well-regarded by buyers.