Blog Post

2024 SRE Report insights: The critical role of third-party monitoring in SRE

Published
April 2, 2024
#
 mins read
By 

in this blog post

The 2024 SRE Report highlights a pivotal shift in how organizations approach the reliability and monitoring of their services, especially those that extend beyond their direct control. According to the report, 64% of organizations now recognize the importance of monitoring productivity or experience-disrupting endpoints, even beyond their physical control.  

“When I think about control, I think about my time on the production operations team at Ask.com, formerly Ask Jeeves,” says Leo Vasiliou, Author of the SRE Report. “We had a relationship with Google, and the Google ad call was one of our most monitored endpoints. Why? Because when there was a problem with it (that is, when there was a problem with one of our revenue streams), our CEO was in our office(s), not Google’s, asking when it was going to be fixed.”

The shift in how organizations approach monitoring their services is a marked departure from traditional paradigms. It acknowledges that the complexity of the modern web and dependence on third-party services necessitates a more comprehensive approach to ensuring resilience.  

It’s unanimous – reliability is a team sport  

A key highlight of the SRE Report is the analysis of responses segmented by the respondents’ organizational ranks, an insightful feature introduced in the 2023 SRE Report. Previous findings uncovered misalignment between practitioners and management in several key areas, including the value received from AIOps, how much of a problem tool sprawl is, and the perception of blamelessness. However, the consensus across all organizational levels on the necessity to monitor external endpoints underlines a rare alignment in priorities.  

Individual contributors and business leaders unanimously agreed that modern reliability practices must include third-party services. Note the upward trend of agreement to this question as organizational rank increases.  

This trend is significant as it signals a shared priority transcending hierarchical boundaries. It suggests that understanding the importance of third-party monitoring is not just a technical issue but also recognized as a strategic necessity. This is critical for securing resources and executive support for implementing the tools and strategies required to gain visibility of third-party services.  

The danger of relying on third-party services  

Integrating third-party software into websites offers both advantages and challenges. eCommerce websites and platforms often depend on multiple external apps and tools to enhance their capabilities, such as personalizing customer experiences, offering live chat services, and analyzing how changes affect user interactions. Although these third-party solutions are crucial for adding interactive elements, they can also cause disruptions to user experience.  

Given that any single point of failure in the Internet Stack can break a system, incorporating third-party services within the scope of reliability engineering is crucial. The report states, “The need to drive reliability and resilience will increasingly necessitate the inclusion of third-party vendors in monitoring strategies. We see this area of reliability work as an opportunity to both build better relationships beyond the “four walls” of the company and to improve learning about those lesser-monitored areas of the Internet Stack, such as BGP and SASE.”  

The opportunity in loss of control

In their ‘View from the field’ commentary, Sarah Butt and Alex Elman propose that the loss of control over certain aspects of the digital ecosystem presents a unique opportunity rather than a setback. By decentralizing architectures and sharing responsibilities, “teams can focus on core competencies while relying on the expertise of others in non-core areas. This dynamic is often necessary and helpful since it introduces new ways of working in complex sociotechnical systems.”  

The pair also address the “multi-party dilemma,” which encapsulates the challenges faced at the intersections of interdependent organizations, especially during incidents. To navigate this dynamic effectively, they suggest activities to increase reciprocity, establish shared frames of reference, and reduce asymmetry. Read their suggestions in the report.  

Coming soon: A new vision for monitoring third parties

Most Application Performance Monitoring (APM) platforms available today offer only limited, if any, insight into the Internet Stack. This lack of visibility is why many organizations face difficulty delivering resilient digital experiences. IT teams are essentially left flying blind, lacking a comprehensive view of the third-party services their applications rely on.  

Imagine having a unified view of your Internet Stack, where every dependency is mapped out on a single dashboard, even those beyond your control. Coming soon to the Catchpoint platform is a first-of-its-kind capability that automatically discovers all the essential components, both internal and third-party, of an organization’s critical services or applications. This groundbreaking tool will tackle the multi-party dilemma faced by SREs, providing the proactive visibility required for resilient digital experiences. Stay tuned for updates.  

Read the 2024 SRE Report today (no registration required)

The 2024 SRE Report highlights a pivotal shift in how organizations approach the reliability and monitoring of their services, especially those that extend beyond their direct control. According to the report, 64% of organizations now recognize the importance of monitoring productivity or experience-disrupting endpoints, even beyond their physical control.  

“When I think about control, I think about my time on the production operations team at Ask.com, formerly Ask Jeeves,” says Leo Vasiliou, Author of the SRE Report. “We had a relationship with Google, and the Google ad call was one of our most monitored endpoints. Why? Because when there was a problem with it (that is, when there was a problem with one of our revenue streams), our CEO was in our office(s), not Google’s, asking when it was going to be fixed.”

The shift in how organizations approach monitoring their services is a marked departure from traditional paradigms. It acknowledges that the complexity of the modern web and dependence on third-party services necessitates a more comprehensive approach to ensuring resilience.  

It’s unanimous – reliability is a team sport  

A key highlight of the SRE Report is the analysis of responses segmented by the respondents’ organizational ranks, an insightful feature introduced in the 2023 SRE Report. Previous findings uncovered misalignment between practitioners and management in several key areas, including the value received from AIOps, how much of a problem tool sprawl is, and the perception of blamelessness. However, the consensus across all organizational levels on the necessity to monitor external endpoints underlines a rare alignment in priorities.  

Individual contributors and business leaders unanimously agreed that modern reliability practices must include third-party services. Note the upward trend of agreement to this question as organizational rank increases.  

This trend is significant as it signals a shared priority transcending hierarchical boundaries. It suggests that understanding the importance of third-party monitoring is not just a technical issue but also recognized as a strategic necessity. This is critical for securing resources and executive support for implementing the tools and strategies required to gain visibility of third-party services.  

The danger of relying on third-party services  

Integrating third-party software into websites offers both advantages and challenges. eCommerce websites and platforms often depend on multiple external apps and tools to enhance their capabilities, such as personalizing customer experiences, offering live chat services, and analyzing how changes affect user interactions. Although these third-party solutions are crucial for adding interactive elements, they can also cause disruptions to user experience.  

Given that any single point of failure in the Internet Stack can break a system, incorporating third-party services within the scope of reliability engineering is crucial. The report states, “The need to drive reliability and resilience will increasingly necessitate the inclusion of third-party vendors in monitoring strategies. We see this area of reliability work as an opportunity to both build better relationships beyond the “four walls” of the company and to improve learning about those lesser-monitored areas of the Internet Stack, such as BGP and SASE.”  

The opportunity in loss of control

In their ‘View from the field’ commentary, Sarah Butt and Alex Elman propose that the loss of control over certain aspects of the digital ecosystem presents a unique opportunity rather than a setback. By decentralizing architectures and sharing responsibilities, “teams can focus on core competencies while relying on the expertise of others in non-core areas. This dynamic is often necessary and helpful since it introduces new ways of working in complex sociotechnical systems.”  

The pair also address the “multi-party dilemma,” which encapsulates the challenges faced at the intersections of interdependent organizations, especially during incidents. To navigate this dynamic effectively, they suggest activities to increase reciprocity, establish shared frames of reference, and reduce asymmetry. Read their suggestions in the report.  

Coming soon: A new vision for monitoring third parties

Most Application Performance Monitoring (APM) platforms available today offer only limited, if any, insight into the Internet Stack. This lack of visibility is why many organizations face difficulty delivering resilient digital experiences. IT teams are essentially left flying blind, lacking a comprehensive view of the third-party services their applications rely on.  

Imagine having a unified view of your Internet Stack, where every dependency is mapped out on a single dashboard, even those beyond your control. Coming soon to the Catchpoint platform is a first-of-its-kind capability that automatically discovers all the essential components, both internal and third-party, of an organization’s critical services or applications. This groundbreaking tool will tackle the multi-party dilemma faced by SREs, providing the proactive visibility required for resilient digital experiences. Stay tuned for updates.  

Read the 2024 SRE Report today (no registration required)

This is some text inside of a div block.

You might also like

Blog post

2024: A banner year for Internet Resilience

Blog post

Performing for the holidays: Look beyond uptime for season sales success

Blog post

Catch frustration before it costs you: New tools for a better user experience