Blog Post

July 19th global IT outage reminds us of digital complexity

There have been 2 major Internet outages in the last 24 hours: both reminders of the fact the Internet is not magically infallible nor inherently resilient.

As we write, on Friday July 19th, a massive global cyber outage is continuing to take down critical services around the world dependent on Microsoft-based computers.  

In what appears to be one of the biggest outages ever, daily life is being impacted around the world on a micro scale (in the UK, for instance, local doctors are seeing only very ill patients and writing up their notes by hand) to macro – grounding major airlines, taking emergency services offline, and preventing major banks and enterprises from doing business.  

Cybersecurity firm CrowdStrike has taken responsibility for the issue, blaming a faulty automatic software update, which has knocked affected Microsoft PCs and servers offline, forcing them into a recovery boot loop preventing their machines from properly starting.  

“The scale of today's global IT outage is unparalleled in recent history. It serves as a stark reminder that our entire world is powered by digital experiences and that the internet is neither magically infallible nor inherently resilient,” said Mehdi Daoudi, CEO and co-founder of Catchpoint. “It is also a reminder you need to manage and control change: Don't blindly update software or change configuration.

At any moment, even the smallest oversight or piece of unpreparedness can bring systems—and consequently businesses—down.

Preparation and visibility are key, not just to prevent such outages but to mitigate the vast financial risks they pose. The fallout from today’s event will likely be measured not just in the disruption of services but in exponential financial losses worldwide, potentially amounting to millions or even billions in lost revenue. It highlights a critical vulnerability: our increasing dependency on digital infrastructure can translate into staggering costs when that infrastructure fails. Kudos to all the IT professionals and teams who are working tirelessly to resolve this issue and restore services.”  

A second major outage within the last 24 hours

While everyone is looking at CrowdStrike (the exact scale, impact and ramifications of which are harder to detect from outside – since it’s caused by faulty software and not a service), Catchpoint caught a separate significant outage within the last 24 hours, which very likely impacted some companies twice.

This has caused widespread confusion in the media as various news sites have posted news relating the two issues when, in fact, they are independent of one another.

Any Internet-based services that were reliant on the Azure Central region and did not have multi-region or multi-cloud strategies in place reliant on Azure within the timeframe of the incident on July 18th in the Central US region would have been impacted, including knock-on dependencies, such as APIs used by eCommerce sites affecting site functionality. Let’s take a deeper look.

Catchpoint’s Internet Sonar detects initial set of issues impacting Azure

On Thursday July 18th, 2024, Catchpoint’s Internet Sonar detected the outage with Azure Services that disrupted critical services across the Central US region. The outage lasted from 18:37 to 22:17 EDT and led to numerous sites experiencing HTTP503 responses, particularly those using Azure Functions. Catchpoint data could rapidly isolate the issue to quickly confirm it was not related to network issues, saving network teams time and resources on unnecessary triage or further network-related troubleshooting.

Internet Sonar shows Azure Services outages impacting critical services across the Central US region (Internet Sonar/Catchpoint)

Major impact on Microsoft services

During this period, Microsoft 365 services were also impacted. Users encountered difficulties accessing a range of business-critical services, including SharePoint Online, OneDrive, Teams and other Microsoft services. 

A graph showing a number of dataDescription automatically generated with medium confidence

  

During the outage, assets stored on OneDrive were significantly affected, with users experiencing HTTP 503 responses when attempting to access these files.   

Microsoft Teams also faced disruptions during the outage. Users encountered HTTP503 response while accessing Teams on the browser.    

Impact to eCommerce providers

We also observed failures on API requests for some major e-commerce providers, which caused issues when users attempted to add products to their carts.   

A screenshot of a computerDescription automatically generated

Major outages reveal a complex digital world

In The Internet Resilience Report 2024, when we asked how critical third-party platform providers were to digital or Internet Resilience success, only 1% of respondents said there was no criticality at all. 77%, meanwhile, said third-party providers were extremely or highly critical to their Internet Resilience success.

The two major IT outages within the last 24 hours demonstrate once again how interdependent we are in today’s highly complex digital world. There are so many different operating systems in use, so many services. You never know when someone might bring you down, and you need to be ready for when they do. As these outages show, multiple things can fail, and the ramifications can be enormous when they do.  

3 key takeaways

#1 - Prepare for failure

It’s crucial to prepare ahead of time. The faster an outage is detected, the faster remediation efforts can begin to minimize the impact to the bottom line. Our customers tell us repeatedly that one of the primary reasons they work with Catchpoint is for proactive detection of outages and service degradations, which we are able to highlight often ahead of the vendor’s own announcements.  

#2 - Know your dependencies and monitor them

Chart your dependencies - you can use Catchpoint’s Internet Stack Map to do exactly that. CNN, for example, has over 600 dependencies for its homepage alone to load. As this latest outage proves at a massive scale, the Internet is not infallible. From security software to cloud services, we are clearly hugely reliant on third parties.  

One of the ways for your sysadmins and operations teams to get the sleep they deserve – and to mitigate the impact of incidents proactively – is to remove any monitoring gaps. Achieve Internet resilience by monitoring the output and performance of every component - external and internal.  

#3 - Trust and verify changes

These outages are a reminder you need to manage and control change. Perhaps the biggest takeaway of all: don't blindly update software or change configuration. Control software changes and always test before you globally deploy.

Ultimately, developing failover strategies for all your important services – across the spectrum, from security services to web performance – is essential in today’s complex interdependent digital world.  

Resources

This is some text inside of a div block.

You might also like

Blog post

The Need for Speed: Highlights from IBM and Catchpoint’s Global DNS Performance Study

Blog post

Customer Survey 2024: Unveiling insights and impact

Blog post

Learnings from ServiceNow’s Proactive Response to a Network Breakdown