Blog Post

Managing an Outage: Catchpoint and Zendesk Case Study

Published
May 25, 2016
#
 mins read
By 

in this blog post

We talk a lot about analyzing and responding to outages. The truth is, much of the advice we give comes from watching how our clients deal with performance issues. We recently got a chance to manage an issue of our own, so we wanted to document what we saw and what we did. We think of it as a chance to eat our own dog food and to show off what we learned (and even what we’ve got left to learn) from the masters of customer experience management.

Like our clients, Catchpoint depends on third-party services to deliver the customer experience we aim for. We have strong partners and partnerships, but we all know that systems go down sometimes. We rely on Zendesk to support our help portal, and on the morning of April 22nd our system alerted us to an issue that would clearly affect our customers:

Dogfood_Chart_705b

As this chart shows, our own synthetic testing against support.catchpoint.com detected connection, timeout, and 404 errors. At least some of our users were likely to find the help portal slow or unreachable.

Almost immediately Zendesk confirmed what we saw, and that we weren’t the only client impacted. This is a good example of a proactive response to an issue. Virtually every Zendesk client knew right away that the company was experiencing a technical issue, that the company was actively working on it, and that they’d traced it to a specific data center. We wrote recently about app rage, and how little it helps to badger a trusted partner to fix an issue that you know they’re dealing with. Zendesk communicated openly about the problem. We didn’t need to become another problem.

Dogfood_Zendesk_705b

Instead we followed their lead and made sure that our own customers knew we were aware of the problem. Within ten minutes of detecting the outage we provided a clear update at status.catchpoint.com with what we knew about the issue, and offered an alternative way to reach our support team. Our customers didn’t need to call us or search for information because we found and addressed the issue so early.

Our team acted fast to get ahead of the situation, but that’s not really why our response was so quick. We have a playbook for service disruptions and our team followed it. Most of our clients have far more complex contingencies for detecting, diagnosing, and responding to outages, but our commitment to plan ahead and prepare for failures is based on best practices they developed.

The entire incident lasted only a few minutes. Zendesk communicated proactively, Catchpoint kept customers informed, and we continued to manage support requests without an interruption. Solid preparation by our partner and our team, and early detection through synthetic monitoring minimized the impact of a service outage. Ultimately our solution did its job and protected our customer experience, so at least on this day our dog food tasted pretty good.

We talk a lot about analyzing and responding to outages. The truth is, much of the advice we give comes from watching how our clients deal with performance issues. We recently got a chance to manage an issue of our own, so we wanted to document what we saw and what we did. We think of it as a chance to eat our own dog food and to show off what we learned (and even what we’ve got left to learn) from the masters of customer experience management.

Like our clients, Catchpoint depends on third-party services to deliver the customer experience we aim for. We have strong partners and partnerships, but we all know that systems go down sometimes. We rely on Zendesk to support our help portal, and on the morning of April 22nd our system alerted us to an issue that would clearly affect our customers:

Dogfood_Chart_705b

As this chart shows, our own synthetic testing against support.catchpoint.com detected connection, timeout, and 404 errors. At least some of our users were likely to find the help portal slow or unreachable.

Almost immediately Zendesk confirmed what we saw, and that we weren’t the only client impacted. This is a good example of a proactive response to an issue. Virtually every Zendesk client knew right away that the company was experiencing a technical issue, that the company was actively working on it, and that they’d traced it to a specific data center. We wrote recently about app rage, and how little it helps to badger a trusted partner to fix an issue that you know they’re dealing with. Zendesk communicated openly about the problem. We didn’t need to become another problem.

Dogfood_Zendesk_705b

Instead we followed their lead and made sure that our own customers knew we were aware of the problem. Within ten minutes of detecting the outage we provided a clear update at status.catchpoint.com with what we knew about the issue, and offered an alternative way to reach our support team. Our customers didn’t need to call us or search for information because we found and addressed the issue so early.

Our team acted fast to get ahead of the situation, but that’s not really why our response was so quick. We have a playbook for service disruptions and our team followed it. Most of our clients have far more complex contingencies for detecting, diagnosing, and responding to outages, but our commitment to plan ahead and prepare for failures is based on best practices they developed.

The entire incident lasted only a few minutes. Zendesk communicated proactively, Catchpoint kept customers informed, and we continued to manage support requests without an interruption. Solid preparation by our partner and our team, and early detection through synthetic monitoring minimized the impact of a service outage. Ultimately our solution did its job and protected our customer experience, so at least on this day our dog food tasted pretty good.

This is some text inside of a div block.

You might also like

Blog post

2024: A banner year for Internet Resilience

Blog post

Performing for the holidays: Look beyond uptime for season sales success

Blog post

Lessons from Microsoft’s office 365 Outage: The Importance of third-party monitoring