Welcome to Episode 10
@Gangrif and @Xenophage make a great pair that will titillate ones’s ears! They cover things in the ops and
infosec news categories and topics that are relatable or at least interesting to discuss. It’s not your typical
format of a podcast, but that is what makes it refreshing.
Keep up the great content guys!
Patreon, you guys are awesome
Youtube stream for this episode! https://youtu.be/EeD5y34oKNY
Trouble in the cloud, The 2/28/2017 US East 1 S3 outage
An Amazon employee was troubleshooting a problem with their S3 billing mechanisms.
A mistake made in an established playbook, took down systems that were not intended to be taken down
The downtime which was intended only for billing systems, took down systems essential in both reads and writes to he S3 API.
This required that some systems be rebooted.
Reboots on the Index and Placement subsystems (two of the systems mentioned as accidentally rebooted) had not been performed for years
Due to the dependencies between these systems, the restarts took quite some time.
The downtime caused some backlog of requests, and these needed to be processed when the systems were once again operational
The core issues here were the amount of systems un-intentionally taken offline, and the fact that systems that depended on eachother were taken down at the same time.
Amazon has made changes to their tools to help pervent systems from dropping below service affecting thresholds.
They are also working to remove some of the inter-dependencies.
On top of everything, the the S3 status page depended on the health of the S3 service in order to operate.
This made it difficult for customers to view the status of S3.
Intro and Outro music credit: Tri Tachyon, Digital MK 2