19 December 2014 by Bill Boyle - DatacenterDynamics
Microsoft has given a final - and painful - explanation for a major outage of Azure cloud services on November 18, 2014. The downtime left some customers of Azure Storage and other offerings without internet services, and was caused by human error.
In the immediate aftermath, Jason Zander, Microsoft Azure Team (below), said on a blog, that the problem was an automated update, set up to improve performance, putting storage blog updates into an infinite loop. This post apologized and presented a preliminary Root Cause Analysis (RCA), with a promise to address the issue.
In the full explanation, Microsoft says that typical deployment puts updates in an internal test environment, deploys updates to pre-production environments that run production-level workloads, validates the updates, deploys them to a small slice of the production infrastructure then ‘flights’ updates in incrementally larger slices while deploying a process that ensures any missed ‘problems’ will only impact one region from a set of paired regions (e.g. West Europe and North Europe).
Zander's initial explanation was: “During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting [initial tests - Ed]. The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues.”
Microsoft’s explanation for the outage now makes clear that its own procedures were not followed.
“We developed a software change to improve Azure Storage performance by reducing CPU footprint of the Azure Storage Table Front-Ends," says the final report. "We deployed the software change using the described flighting approach with the new code disabled by default using a configuration switch. We subsequently enabled the code for Azure Table storage Front-Ends using the configuration switch within the Test and Pre-Production environments. After successfully passing health checks, we enabled the change for a subset of the production environment and tested for several weeks."
The company then got over-confident. The fix improved performance and resolved some known customer issues with Azure Table storage performance, so the team decided to deploy the fix broadly in the production environment.
"During this deployment," says Zander, "there were two operational errors".
Firstly, the standard policy of flighting deployment changes across small slices was not followed. "The engineer fixing the Azure Table storage performance issue believed that because the change had already been flighted on a portion of the production infrastructure for several weeks, enabling this across the infrastructure was low risk," says the report.
Secondly, although the change had been validated in test and pre-production against Azure Table storage Front-Ends, it was incorrectly enabled for Azure Blob storage. When it was applied to Azure Blob storage front-ends a bug was exposed, which started those infinite loops.
Automated monitoring alerts notified Microsoft's engineering team within minutes, and the change was rolled back within half an hour, which protected some users, but those already affected were in thei infinite loop and not responding: "These required a restart after reverting the configuration change, extending the time to recover.”
Guidelines not used
In short, Microsoft Azure had clear operating guidelines but was relying on human decisions and protocol. It promises that a tooling updates wll make sure the policy is enforced by the deployment platform itself.
In addition Microsoft customers faced three other problems: disk mount timeout, VMs failing in setup and network programming errors.
Microsoft also and admitted it did not communicate well during the incident, and promised to improve on the following:
1) Delays and errors in status information posted to the Service Health Dashboard.
2) Insufficient channels of communication (Tweets, Blogs, Forums).
3) Slow response from Microsoft Support.
· 11/19 00:50 AM – Detected Multi-Region Storage Service Interruption Event
· 11/19 00:51 AM – 05:50 AM – Primary Multi-Region Storage Impact. Vast majority of customers would have experienced impact and recovery during this timeframe
· 11/19 05:51 AM – 11:00 AM – Storage impact isolated to a small subset of customers
· 11/19 10:50 AM – Storage impact completely resolved, identified continued impact to small subset of Virtual Machines resulting from Primary Storage Service Interruption Event
· 11/19 11:00 AM – Azure Engineering ran continued platform automation to detect and repair any remaining impacted Virtual Machines
· 11/21 11:00 AM – Automated recovery completed across the Azure environment. The Azure Team remained available to any customers with follow-up questions or requests for assistance