Feature - The Top Internet Outages of 2024
Ahead of their appearance at the upcoming DTX Manchester exhibition - taking place from 2-3 April 2025 - Cisco ThousandEyes, a network intelligence company, explores some of 2024’s most notable Internet outages and application issues, along with key takeaways to help ITOps teams improve digital resilience in 2025.
In 2025, digital resilience is a top priority for IT Operations teams around the globe. When outages happen, it’s how you identify and recover from them that makes the big difference for users and businesses. Beyond that, consistent proactive optimisation is essential to both elevate digital experiences for users and guard against potential problems before they impact customers.
The biggest outages of 2024 provide plenty of learnings for ITOps teams charged with improving digital resilience in their business, with recurring themes emerging - most notably the number of outages that were the consequence of configuration changes or automation related.
Here, Cisco ThousandEyes goes through some of the most notable outages and disruptions of 2024, identifying key takeaways to help businesses assure great digital experiences for their users in 2025.
Microsoft Teams Service Disruption | 26 January 2024
Microsoft Teams was disrupted for more than seven hours in January, when a problem inside Microsoft’s own network affected the collaboration service.
Frozen apps, login errors, and users left hanging in meeting waiting rooms were some of the symptoms reported during the disruption, which began early in the workday for many Americans.
ThousandEyes’ own observations during the incident indicated that the failure was consistent with issues in Microsoft’s own network. Failover didn’t appear to relieve the issue for many users; although further “network and backend service optimisation efforts” did eventually restore service.
Meta Outage | March 5, 2024
On 5 March, Meta experienced an outage that prevented users from accessing services including Facebook, Instagram, Messenger, and Threads. While the platform appeared to be reachable, many users were unable to proceed beyond the login or authentication process.
Shortly after the outage began, Meta confirmed that it was experiencing problems with its login services. The issue was likely caused by a failure in one of the dependencies that the login system relies on. ThousandEyes observations also point to a backend cause, as Meta’s systems appeared reachable and network paths connecting to the services didn’t display any significant network conditions that could have led to the outage.
This outage serves as a reminder that issues with just one part of the application delivery chain can render the whole service functionally unusable. It’s crucial to have full visibility into your whole digital delivery chain to help identify any drops in performance or functionality.
Atlassian Confluence Disruption | March 26, 2024
In late March, workspace application Atlassian Confluence experienced issues, resulting in customers having problems accessing the service and receiving HTTP 502 bad gateway errors.
While this was a relatively short outage, lasting just over an hour, ThousandEyes’ analysis revealed it affected users all over the globe. By tracing the network paths to the application’s frontend web servers, hosted in AWS, it was clear that this was a backend issue rather than network connectivity itself.
This is one of those outages where relying on error messages would only give you half the story. Identifying the root cause requires you to consider factors such as any third-party dependencies. Being able to rule out issues with a cloud hosting provider, for instance, gets you one step closer to identifying the real problem.
Google.com Outage | 1 May 2024
In early May, Google.com experienced a global disruption lasting around an hour, during which users encountered HTTP 502 error messages instead of the expected search results.
The HTTP 502 status code often indicates a proxy server failing to connect with the origin server. It can also be a sign of overwhelming levels of traffic, but there was no reason to suspect that Google was suddenly struggling under demand, with no extraordinary events to trigger such an influx of search traffic.
ThousandEyes analysis revealed a 'lights on/lights off' scenario, where service suddenly dropped, suggesting a problem with backend name resolution or something connected to policy/security verification, rather than an issue with the search engine itself.
CrowdStrike Sensor Update Incident | 19 July 2024
Organisations in Australia and New Zealand began experiencing issues on Friday 19 July, at mid-afternoon. A range of industries and major brands simultaneously reported outages as their Windows machines reportedly got stuck in a boot loop that ultimately resulted in the BSOD (Blue Screen of Death). The impact quickly spread to other geographies, causing problems with airline booking systems, grocery stores, and hospital services. And these were just the tip of the iceberg.
Initial responsibility for the widespread outage was thought to lie with Microsoft, but a different common denominator emerged: CrowdStrike, a managed detection and response (MDR) service used to protect Windows endpoints from attack.
CrowdStrike published guidance on actions and workarounds for IT administrators, and an early technical post-incident report that attributed the incident to an issue with a single configuration file that “triggered a logic error resulting in a system crash and blue screen (BSOD) on impacted systems.” Recovery wasn’t a simple task, requiring IT staff to physically attend machines to get them functional. At one point, Microsoft reported that up to 15 reboots per machine may be needed.
Cloudflare Disruption | 16 September 2024
Cloudflare is one of the world’s biggest CDN providers, so when it catches a cold, other well-known services start sneezing.
Cloudflare’s 16 September outage lasted for around two hours, and affected applications such as Zoom and HubSpot. The ThousandEyes platform showed the impact on these third-party applications clearly, with agents in the US, Canada, and India all failing to connect to the various applications during the outage.
This is a good example of how you can avert the “Is it just me?” problem. By tracking the entire service delivery process of your applications, you can follow the network paths taken by your apps - and the suppliers they are connected to.
Microsoft Outage | 25 November 2024
Microsoft’s late November outage, which affected services such as Outlook Online, occurred in two parts and wasn’t always easy to spot.
Problems emerged around 2 AM (UTC), with symptoms such as timeouts, resolution failures, and the occasional HTTP 503 error message. The problems were intermittent and not always obvious to end users, with the service sometimes presenting as slow or laggy.
The issue appeared to be resolved within an hour or so, but four hours later problems emerged again, and this time with greater severity. ThousandEyes observed an increase in packet loss at the edge of the Microsoft network and increased congestion connecting to services.
Microsoft later explained the problem was caused by a configuration change that caused an “influx of retry requests routed through servers.” The outage was resolved by performing “manual restarts on a subset of machines that [were] in an unhealthy state.”
OpenAI Outage | 11 December 2024
We almost made it through an entire year of outages without mentioning AI. OpenAI’s December outage affected ChatGPT and the new generative video service, Sora. Users witnessed partial page loads, with requests for further information prompting HTTP 403 error messages.
ThousandEyes observations pointed to backend application issues and that was later confirmed by OpenAI, which revealed that a new telemetry service deployment had “unintentionally overwhelmed the Kubernetes control plane,” causing cascading failures.
Key takeaways from 2024
You’ll notice that most of the major outages of 2024 stemmed from a backend configuration change that had unintended consequences or the failure of an automated system.
ITOps teams have limited control over faulty configuration changes made by service providers. However, they can enhance their overall visibility into service delivery paths, which allows them to quickly identify the source of any errors when they occur.
This approach provides valuable insights into faults or degraded components, enabling teams to take appropriate actions, such as rolling back changes, redirecting to alternative resources, or implementing contingency plans. By thoroughly understanding their service delivery chains, teams can also regularly optimise services to improve digital experiences and enhance digital resilience.
As we have observed in several significant outages of 2024, error messages typically provide only a hint about what has happened; they cannot in isolation identify the cause.
If 2024’s major outages deliver one lesson, it’s that your digital resilience depends on knowing what’s gone wrong - or what could potentially go wrong - even before the service providers themselves acknowledge an issue.
- Cisco ThousandEyes will be exhibiting at the upcoming DTX Manchester event, taking place on 2-3 April 2025. To register, and for more information about the event, click here.
For more news from the DTX exhibitions, click here.
Simon Rowley - 5 February 2025