Many users complained of an outage on Christmas Eve, which stopped them watching their TV shows and movies via Netflix. Amazon have blamed ‘human error’ for the server downtime.
Amazon have said that a developer mistakenly deleted part of the ‘ELB state data’ which handles the load balancing, streaming content across multiple servers. When the issue happened, it took the company several hours to work out exactly what was going wrong.
Amazon said “The service disruption began at 12:24 PM PST on December 24th when a portion of the ELB state data was logically deleted. This data is used and maintained by the ELB control plane to manage the configuration of the ELB load balancers in the region (for example tracking all the backend hosts to which traffic should be routed by each load balancer). The data was deleted by a maintenance process that was inadvertently run against the production ELB state data. This process was run by one of a very small number of developers who have access to this production environment. Unfortunately, the developer did not realize the mistake at the time. After this data was deleted, the ELB control plane began experiencing high latency and error rates for API calls to manage ELB load balancers.”
Initial efforts to take a snapshot of the system configurations prior to the accidental deletion, a process which took several hours, did not work. A second method worked better, however it took some time to implement correctly.
Amazon’s AWS team has to merge the new ELB state data with the old, a process which took 3 hours along. They then had to spend five more hours gradually re-enabling all of the service workflows and APIs in a way that didn’t cause problems for the correctly running processes. Amazon said the system was operating normally by 12.05PM PST.
The company said “Last, but certainly not least, we want to apologize. We know how critical our services are to our customers’ businesses, and we know this disruption came at an inopportune time for some of our customers. We will do everything we can to learn from this event and use it to drive further improvement in the ELB service”.
They have since implemented new policies to ensure this can not happen again. The ELB state data is now harder to delete without specific approval. The team say “We are confident that we could recover ELB state data in a similar event significantly faster (if necessary) for any future operational event.”
Kitguru says: A lesson learned.