Amazon has explained the cause behind this week’s hours-long AWS outage that disrupted thousands of websites and apps worldwide.
The company said a bug in its automation software triggered the massive failure, which affected services ranging from banking platforms to smart beds.
In a detailed report released on Thursday, AWS outlined how a series of cascading events led to the downtime. The outage prevented customers from connecting to DynamoDB, Amazon’s database system that stores data for AWS clients.
According to AWS, the issue stemmed from “a latent defect within the service’s automated DNS [domain name system] management system.”
DynamoDB manages hundreds of thousands of DNS records. Automation is used to ensure these records are frequently updated, to handle hardware failures, add capacity when needed, and distribute traffic efficiently. However, the incident began when an empty DNS record appeared for the Virginia-based US-East-1 datacentre region. The automation system failed to fix it, forcing manual intervention from AWS engineers.
To prevent similar incidents, AWS said it has disabled the DynamoDB DNS planner and DNS enactor automation globally. The company is now working to fix the root cause and introduce additional safeguards.
The outage had a ripple effect on other AWS tools and services. Platforms such as Signal, Snapchat, Roblox, Duolingo, and several banking sites were hit. Ring, the smart doorbell company, was also affected. According to Downdetector, more than 2,000 companies and 8.1 million users worldwide reported issues.
Although services were restored within hours, the impact was significant.
Customers of Eight Sleep, a smart bed company, were unable to adjust bed temperature or incline through their app during the outage. Matteo Franceschetti, the company’s chief executive, apologised to users on X and announced an update that will allow bed functions to be controlled via Bluetooth during future outages.