Blog Post

Don’t get caught in the dark: Lessons from a Lumen & AWS micro-outage

Updated

Published

August 15, 2024

mins read

Dritan Suljoti

Eknath Reddy

Denton Chikura

in this blog post

Heading 2

While major outages like the recent CrowdStrike incident dominate headlines, those of us in the trenches ensuring Internet Resilience know that most of our issues are not necessarily global but localized by geography, autonomous systems, or something else.

Micro-outages – those elusive, localized incidents – can pose the most persistent threat to observability. Though smaller in scale, they can be just as disruptive, affecting specific regions, services, and users without ever making it to the news or a status page.

All of which brings us to Amazon Web Services (AWS).

This week, they experienced a micro-outage that went undetected by many. This micro-outage did not appear on AWS’s status page, but considering their share of the global cloud market, even a micro-outage can have far-reaching consequences.

Catchpoint Internet Sonar (part of our Internet Performance Monitoring platform) detected it and the impact of the disruption for our clients.

Let’s dive into the details.

What happened?

On Wednesday, August 14^th, between 8:00 and 8:25 UTC, Catchpoint’s Internet Sonar detected connection timeouts impacting multiple AWS services, including S3, EC2, CloudFront, and Lambda, across multiple regions from locations on Century Link AS209 and Lumen (level 3) AS3356.

While the event did not cause a 100% outage, it did have a significant impact on these locations being able to reach AWS services.

A map of the worldDescription automatically generated — *Internet Sonar dashboard showing the impact of AWS’s micro-outage, with connection issues detected across different AWS regions, for multiple services.*

A screenshot of a computerDescription automatically generated

Further investigation reveals that the connection timeouts primarily occurred when traffic originated from or passed through Level 3 AS3356 and CenturyLink AS209, suggesting potential peering issues.

It remains unclear what the root cause of the peering issue was. However, we do know only AWS was impacted from these two Lumen ASes.

‍Key lessons

AWS’s micro-outage offers critical insights into the complexities of cloud infrastructure. Here are the key lessons to take away from this incident.

#1 - The Internet is fragile

This incident is a powerful reminder that the stack of a digital service is more than just the code. It includes the cloud platform you run on, the network, BGP peering, DNS, and much more – what we call the Internet Stack. It’s an intricate collection of technologies, systems, and services that power every digital user experience.

Focusing on just one layer of the Internet Stack creates a narrow, opaque view that can ultimately harm your end users and your business.

Outages can stem from any layer— whether it’s the application itself, the network, or routing issues like BGP. Each layer introduces potential vulnerabilities that, when combined, can lead to significant disruptions.

#2 - Status pages aren’t infallible

Status pages aren’t always reliable indicators of service health. During the incident, AWS’s official status page didn’t reflect the issues affecting its services.

Of course, every cloud provider has its own thresholds and processes for determining whether an outage warrants a status page update. They’re not keeping you in the dark on purpose. In many cases, providers might not even be aware of an outage or slowdown until customers start complaining.

As anyone in IT knows, there are times when we believed an issue had no impact on users, only to find out later that it did. This can lead to delays in acknowledging issues, or in some cases, no acknowledgment at all.

The concern grows when it comes to performance degradations and slowdowns. These issues rarely make it to a status page, yet, they can severely affect user experience. Given the high cost of Internet disruptions, even a brief delay in addressing these issues can be extraordinarily expensive. And if you’re waiting for your cloud provider to tell you when something’s wrong, that delay could be even longer.

#3 - Be wary of relying on an observability platform hosted on the cloud

Relying solely on a cloud-hosted observability platform introduces significant risk.

If the underlying cloud provider experiences an outage, your ability to monitor and manage your systems is compromised, potentially leading to false negatives and missed critical issues. This leaves your organization blind to potential problems, increasing vulnerability to unexpected failures.

Popular platforms like AppDynamics, Dynatrace, New Relic, Splunk, and Datadog are all hosted on AWS, making them susceptible to AWS outages. Their synthetic solutions also primarily test from AWS locations, hence you are monitoring AWS from AWS. What could go wrong?

To mitigate this risk and reduce the chance of false negatives, it's essential to diversify your observability strategy and avoid single points of failure.

#4 - Always have a fallback plan – or communicate clearly

Outages, whether large or small, are inevitable in any cloud environment. That’s why it’s essential to have a fallback plan for when your vendor’s services fail. Of course, not everyone can afford a multi-cloud or hybrid infrastructure stack, so if a fallback isn’t feasible, the next best thing is to ensure clear and proactive communication with your users.

Transparency about the issue and the steps being taken to resolve it can help maintain trust, even during disruptions.

By preparing for the unexpected and communicating effectively, you can mitigate the impact of outages and keep users informed and engaged.

Independent proactive monitoring of your Internet Stack is essential

This incident highlights the crucial need for independent monitoring of your cloud services. You can’t afford to rely on someone else to alert you when there’s a problem with your cloud services.

Your users won’t blame the cloud provider – they’ll blame you.

If your clients are businesses and you have SLAs in place, an outage like this could lead to a breach. Just 25 minutes of downtime could result in a 0.06% hit to your SLA. And if the provider hasn’t acknowledged the issue, your users are left with a poor user experience and no explanation. Even if an explanation comes later, the damage to your reputation and potentially, your revenue is already done.

This is where Catchpoint IPM comes in.

We’ve built our platform from the ground up to deliver deep and wide visibility into the Internet Stack, enabling you to find and fix disruptions before your business is affected. Our cloud-native platform ensures Internet Resilience across your organization with the following industry-leading features and capabilities:

Unparalleled worldwide and regional visibility through our Global Observability Network with over 2,700 nodes from more than 360 providers in 101 countries – with more being added all the time.
Proactive incident management so you can identify and resolve issues, proactively, across public and private networks and application layers, to enable IT teams to identify root cause and triage, fast.
AI-powered tools, including:
- Internet Sonar so you can answer the question, “Is it me or something else?” quickly.
- Internet Stack Map for instant awareness of critical service or application issues.

The AWS outage is a reminder of the fragility and complexity of the Internet. Whether it’s micro-outages that fly under the radar or headline-grabbing disruptions, depending on cloud provider status pages is a gamble you can’t afford to take. Without IPM tools like Internet Sonar, you’re operating in the dark, leaving your users, your reputation, and your money at risk.

Learn more about preventing outages from our guide, or test drive Catchpoint for yourself in our guided product tour.

Summary