Blog Post

Sorting Through the Wreckage of Last Week’s Outages

Major outages left some of the huge companies, like United Airlines and the New York Stock Exchange, scrambling to get their businesses back online.

Every now and then, when you step back and consider how reliant our society has become on online systems, it can really blow your mind. When you do it because those systems seem to be crashing all around us, it can be downright terrifying.

Such was the case last Wednesday, when United Airlines grounded its flights around the world due to a software glitch, the New York Stock Exchange suspended trading for four hours due to problems with their internal systems, and the Wall Street Journal homepage experienced significant problems, with localized outages around the country.

Now, whether or not you retreated your bunker on Wednesday and started making plans to repopulate the Earth, Colbert did make an important point: our reliance on technology has made our lives and businesses incredibly more efficient, but the fragility of those systems gets transposed onto us the more we rely on them.

To make matters worse, companies are usually hesitant to make the investment of both time and money that’s necessary to completely overhaul their software systems. This typically leads to technical debt – software updates being built on top of the old code (particularly for companies that have been using it for a long time, like airlines) rather than building new, advanced systems from scratch. The end result is a product that gets the job done most of the time, but has significant holes.

That’s the problem pointed out by Zeynep Tufekci, who says that while people were panicking last week about cyber-terrorism possibly playing a role the outages, they were ignoring the much greater risk of relying on outdated and flawed software systems.

Of course, this vulnerability only increases the need for proactive monitoring in order to catch the problems that crop up in these systems before they cause widespread outages and slowness. By running continuous synthetic tests on your software and infrastructure, you can not only catch the major problems like those suffered by United and the NYSE, but also the smaller “micro-outages” that the WSJ experienced.

In the meantime, 2015 continues to be the year of the outage. In addition to the three major ones last week, we’ve also seen other problems with United and rival airlines like American due to third party issues, tech giants like Facebook and Apple – specifically, iTunes – go down for hours at a time, and even Starbucks was forced to close thousands of locations around the country for a night in April due to a problem with their POS systems.

And short of a complete change of mindset on the part of the entire tech industry, it’s probably not going to get better anytime soon. The best we can do is to try and stay on top of it as much as possible.

News & Trends
SLA Management
This is some text inside of a div block.

You might also like

Blog post

Mastering IPM: Key Takeaways from our Best Practices Series

Blog post

Mastering IPM: Protecting Revenue through SLA Monitoring

Blog post

The SRE Report 2024: Essential Considerations for Readers