Subscribe to our
weekly update
Sign up to receive our latest news via a mobile-friendly weekly email
Welcome to the seventh edition of The SRE Report.
Catchpoint is proud to continue dedicating this independent research to advancing reliability and resilience practices with the hopes of simply making it better.
The SRE Report is built upon insights from our annual SRE survey. It delves into recurring themes such as time spent and whether varying levels of managerial responsibilities influence perceptions about reliability and resilience. This year, we also explore new research such as production pressures and whether [digital] performance has come of age. True to form, we haven’t set out to provide prescriptive advice but rather to present the data as objectively as possible, empowering you to derive insights most relevant to your unique challenges.
Site Reliability Engineering (SRE) has the power to transform how organizations manage and ensure system reliability. Embracing SRE is more than a technical enhancement; it’s a strategic journey toward future success. However, the journey towards organizational transformation and success must commence with introspective individual observations. Only by understanding that our work is the foundation for organizational outcomes can we align on the significant opportunities ahead and the path for achieving them.
Whether this report data confirms existing beliefs or sparks new ways of thinking, we sincerely hope you enjoy reading it.
CEO, Catchpoint
It’s no longer just about whether services are up or down—it’s about whether they perform.
We open this year’s report with an investigation into the validity of an increasingly popular catchphrase—something we often hear bandied about in corridors: “Slow is the new down.”
The phrase “Slow is the new down” means that poor performance is as bad as complete downtime or unavailability. It illustrates an evolution in performance mindset, highlighting its importance as a critical dimension, beyond just uptime. Therefore, we set out to determine whether this expression holds true in practice. The majority (53%) of organizations agree with this expression even though only 21% say they have heard it before.
What makes this piece of the research so interesting is the high frequency of selected answer choices having nothing to do with the asked question. For example, note the second most popular statement, ‘Performance should be tracked against a service level objective’ (44%).
Given the previous data on tracking performance indicators against service level objectives, we decided to put the results from this priority question here. For this report, Site reliability engineering should not surprise as the top priority result (41%). The choice of prioritizing service or experience level objectives (SLOs or XLOs) as the second highest choice (40%) only emphasizes the need to track performance indicators against objectives.
Tracking these indicators against objectives also includes the concept of budgets—predetermined allowances for acceptable errors or deviations in system performance. These budgets ensure that resources are allocated to uphold service standards, which in turn reduces the risk of performance degradation.
Performance and error budgets are inherently part of SLOs and XLOs. Given the emphasis on SLOs and XLOs, and the sentiment to track performance indicators against those objectives, this burndown chart is presented to illustrate how one may visualize that tracking. In other words, a burndown shows whether your indicators (the blue line) are on track to meet or breach the objectives (the red line) over time. It also shows how much budget—essentially the delta between them, in this case a performance budget—is left to be ‘spent’.
Visualizing error budgets helps stakeholders understand risk, honor SLOs, and balance agility with stability. Burndown charts can also be applied to third-parties, making them a powerful ally for managing vendor relationships.
These performance indicators should not be relegated to exclusively internal or first-party perspectives. Instead, also consider external and multi-party perspectives. Managing performance should not be done in a silo. Performance practices should look holistically at the overall system.
We were pleased to see the delta between all applicable attributes is smallest when it comes to performance optimization. Unfortunately, the rating for the applicable performance optimization attribute ranked lowest in terms of frequency when compared to the other applicable attributes of instrumentation or stack maps/service maps. In this vein, we highlight the opportunity before us all to increase the importance of continual performance optimization because client’s digital experience expectations will also continually rise.
For most teams, it seems the burden of operational tasks has grown.
The survey used to generate insights for this report has employed a consistent methodology for the last seven years. This ensures the data remains trustworthy and comparable, leading to provocative conversations about industry trends. When toil levels rise for the first time in five years (according to this report), it certainly calls for a closer look.
In previous years, we theorized that the drop in toil was due to widespread work-from-home policies, with an expected rise as offices reopened. However, despite ongoing efforts to bring employees back, the predicted large-scale return to the office has not materialized.
Google defines toil as "the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows." Hence, the need to “automate all the things” is a key SRE tenet.
It’s hard to ignore the AI-shaped elephant in the room—the major shift in how work has evolved over the last five years. The general expectation was that AI would reduce toil, not exacerbate it. The jury is still out on whether AI is adding toilsome activities in unexpected ways. However, the 2024 DORA Report suggests a more nuanced reality may be at play, stating, “AI does not steal value from respondents’ work; it expedites its realization.” Paradoxically, the free time created by expediting valuable activities may end up being filled with toilsome tasks. This is at least one hypothesis for the rise in toil levels observed in this SRE report. As ever, this will be another key area of focus in next year's report.
There are two critical takeaways on this page.
First, these time-spent questions are part of our recurring research. They are offered as industry indexes for organizations to benchmark against. For example, they may be considered against other data in this report, other research, or a given organization’s own indicators.
Second, the p25, median, and p75 values for engineering activities are uncannily identical to those in The SRE Report2024. Yet, those values for operations activities have risen.
Last year, the median value for operations activities was 25%, but this year it has risen to 30%. For most teams, it seems the burden of operational tasks—often synonymous with toil—has grown, encroaching on time that could otherwise be invested in proactive engineering efforts. This uptick in operational load signals a potential red flag, where organizations may find themselves weighed down by routine, repetitive tasks, thereby limiting their capacity for innovation and strategic development.
The consistent survey methodology we use year after year helps us draw meaningful comparisons. We were curious if the increase in toil and ops work trended with time on-call. It turns out it doesn't. Much like last year's results, on-call rotation practices have remained fairly consistent.
And remember, time in these rotations isn't simply spent—it's invested. By effectively handling issues during on-call rotations, teams are investing in learning and development, building resilience, and proactive optimization.
The higher the production pressure to perform, the less stable an organization’s priorities appear to be.
In this section of the report, we explore the perceived stability of organizational priorities against whether there is production pressure to perform. We also discuss the use of objectives and key results (OKRs) and whether reliability practitioners feel their problems and challenges are addressed.
Stability is the predominant experience for most teams (57% with a stable sentiment). This is encouraging, as it creates an environment where teams can better allocate resources, plan ahead, and minimize the disruptions that can come from shifting priorities.
The majority of organizations believe OKRs are clearly communicated (58% agree sentiment). Communicating OKRs is crucial for aligning team efforts with business goals, fostering transparency, and driving accountability. Clear OKRs ensure everyone understands priorities and enhances reliability practices. However, striving to meet OKRs can reveal challenges such as resource constraints or misaligned objectives. Addressing these issues promptly is essential to maintain morale, optimize performance, and ensure strategic alignment.
By ensuring two-way communications between the business and reliability practitioners, strategies can adapt, resources can be properly allocated, and both agility with stability can be had. This proactive approach not only mitigates risks but also reinforces a culture of continuous improvement and resilience. After all, change is inevitable in today’s highly competitive marketplace.
The majority of organizations believe reliability challenges and problems are addressed (53% agree sentiment). This cannot happen without a supporting culture.
A culture of transparency is vital for addressing reliability challenges. When teams openly share information, issues are identified and resolved more quickly, preventing minor problems from escalating. Transparency fosters trust and collaboration, encouraging team members to communicate openly about potential risks and failures. This approach ensures that reliability practices are continuously improved, as everyone is aware of the current state and can contribute to solutions. Additionally, transparent environments promote accountability, as individuals understand the impact of their actions on overall reliability. Ultimately, this culture leads to more robust and resilient systems, enhancing overall performance and reliability.
This is where the story gets a little complicated. Despite the majority of organizations believing that OKRs are clearly communicated, reliability challenges are being addressed, and stability the predominant posture for most teams, the majority still indicated feeling pressured to prioritize release schedules over reliability.
It’s a classic case of agility versus stability: businesses want updates, new features, and revenue growth, whereas practitioners prioritize reliability and resilience. More than half of organizations are frequently caught in a tug-of-war between meeting tight deadlines and maintaining reliability. The tension is evident in these results.
The high number of respondents choosing “often” or “always” (41%) is especially concerning. It underscores the challenge of balancing speed with stability in fast-paced environments.
When production pressure to perform is more frequent, the less stable an organization’s priorities appear to be. In other words, the agility versus stability war rages on.
If we accept business priorities will inevitably change, then priorities are stable until they aren’t. Evaluate whether existing capabilities can be reused to account for new or different priorities. If they cannot be reused, and new capabilities are needed, this is a wonderful opportunity for reliability practitioners to build resilience and reusability into them. That is, the adaptive capacity of capabilities themselves should also be considered when discussing priorities.
Note the negative relationship trend of unstable organizational priorities was seen - to a lesser extent - when investigated across OKR communication and reliability problems being addressed by the organization. In this vein, we reinforce the idea of ensuring reusable, adaptable capabilities even if organizational priorities change.
Most organizations are using multiple monitoring tools, and that’s okay.
All of us, not just SREs, are under pressure to cut costs. However, this drive to reduce spending should never come at the risk of losing value.
We’ve explored this topic in multiple SRE Reports. Though the questions may have been expressed differently, they were meant to research the same idea: Is there a tool sprawl problem? This year, we decided to ask yet another form of this question: Is the received value from your monitoring/observability tooling greater than the cost?
Cost takes many forms, e.g., in the form of hard dollars or in the form of time spent implementing and maintaining. Either way, it’s crucial to understand that tool sprawl isn’t simply about “How many tools are in the stack?” Rather, if the received value from a toolchain is net positive when compared to cost, then there is no tool sprawl problem. In fact, organizations answered with the same directional value trend regardless of how many tools they have in use.
Most organizations use between 2 - 10 monitoring or observability tools, with little change from last year. With so many organizations relying on multiple tools, we suggest self-evident value in doing so. It's unrealistic to think that everything can be effectively monitored with just one tool. Different technology stacks, such as application versus Internet stacks, require different tools.
Many tools are designed to address specific scenarios. Practitioners may need to rely on a variety of different tools from either single or multiple vendors for comprehensive coverage - and this is okay.
Rather than fixating on the idea that ‘there are too many tools in the stack,’ the focus should shift to evaluating if the value gained from these tools justifies their total cost of ownership (or operation)—understanding that cost can take various forms.
Observability is undeniably critical, but are we truly getting the right level from our tools? The majority of organizations believe their level of observability falls short (51% ‘less than’ sentiment). Interestingly, 16% with a ‘more than’ sentiment suggests another side of the challenge.
Striking the right balance in observability is essential for effective system monitoring. Too little observability can leave critical issues undetected, leading to prolonged downtime and difficult troubleshooting. On the other hand, too much observability can flood teams with more noise than signal, making it hard to surface and explore meaningful insights.
The key is to implement an approach focusing on capturing relevant telemetry that provides actionable insights. In this context, it’s worth noting the data we previously mentioned: 41% of organizations said their front end is instrumented, while 51% said their back end is instrumented. We mention this again because conversations about relevance and value - as compared to cost - will most undoubtedly include digital experiences and business outcomes. It is in that vein we stress the importance of instrumentation to monitor performance or behavior of both front and back end, and having the ability to tie those metrics to what’s important to the business.
Consolidating tools might seem like a solution to manage complexity, but this data shows that less tools trend with less perceived observability instrumentation. If observability truly is a critical business function - not just a “nice to have” - then be leery of trying to consolidate or minify tools only for the sake of, well, trying to consolidate or minify tools. One pane of glass isn’t enough.
Organizations using ‘more’ tools trended with much better outcomes in terms of feeling they have the “correct amount” of observability. But remember, it’s not just simply a count of the tools in your profit and loss statement. It all comes down to received value. If the value outweighs the cost, then by all means, go and get the tool for the thing you need. A best-of-breed approach allows you to cover different stacks—whether it’s application or Internet—each requiring specialized visibility.
We close this section with a breakdown of perceived value for observability components: logs, metrics, events, and traces. The good news is the value for all listed components has ‘high’ as the majority vote. Also keep in mind the question was asked, ‘How much value do these provide to you?’ This contrasts with, for example, if we had asked, ‘How much value do these provide to your organization?’
Combining insights from logs, metrics, and traces allows for a more holistic view of the system, enabling better decision-making and more effective problem resolution. So we felt it important enough to present this breakdown and discuss it as a ‘greater than the sum of their parts’ mindset. In the context of the ‘cost versus value’ research, we suggest this will be part of the conversation when purchasing (or building) the correct tooling.
The desire for technical training on artificial intelligence (AI) is universal.
Technical training should be a key investment area for any organization to increase velocity and modernize implementations.
After all, new technologies like AI aren't going to implement themselves (not yet anyway). Employees need to be supported with the right training to meet evolving business objectives, but as the results show, support for technical training is not universal. One in five organizations is not making this investment, and the fact that 8% of respondents indicated “I do not know” regarding training support is concerning, pointing to a potential gap in communication or awareness about available resources. Training isn’t just about keeping up; it’s about staying ahead, ensuring the workforce is equipped to fully leverage new tools and frameworks, from traditional tech skills to newer AI capabilities.
In a digital-first, Internet-centric landscape, we are not surprised that online training was the most popular option (55%) for valuable learning sources. We are also pleased that onsite, in-person training also ranked high on the list (45%).
We again stress that technical training should be a cornerstone for any organization. There will be many contributing factors shaping what that looks like, though. For example, on the next page, levels of managerial responsibility trend differently for which learning sources are most valuable. Before continuing, we also wanted to give an honorable mention to [some form of] labs or laboratories as a valuable learning source. This question had an open-ended option, and labs was a popular write-in.
The skew between individual contributors and management learning preferences reveals a potential disconnect. While management favors onsite, in-person training, likely to encourage collaboration and maintain in-person culture, individual contributors clearly lean toward online training formats that offer flexibility. This disconnect might highlight differing values—management seeking control and face-to-face interaction, whereas individual contributors prioritize autonomy and adaptability. Or it could be that managers are looking to enhance their “people skills” which might be more effectively done with an in-person format.
While technical training is universally seen as critical, most people expressed that they simply don’t have the time. The gap between intent and action here speaks to the pressure organizations feel to prioritize other tasks over upskilling—especially considering that reliability engineering is, at its core, a deeply technical undertaking. When learning is deprioritized, the ability to adapt to new tools and methodologies suffers, which will become evident as we explore the role of AI.
Regarding time spent on technical learning, there is a differing trend between ranks. This is qualitative, though, as we theorize the underlying dimension is simply managers (1 mgmt level) and directors (2 mgmt levels) may feel like they need less tech training to begin with. So we are not surprised to see this.
Training is important…but on what? We asked about AI, but the sentiment around it remains varied. People are understandably hesitant about diving into AI implementations. Notably, the desire for technical training was the second most selected, following only the cautionary sentiment. While AI is undeniably the hot shiny thing right now, are there other crucial areas that warrant your attention? It remains to be seen whether training on AI is the path to mastery or misery.
Analyzing the top two choices from the AI implementations sentiment question by rank reveals alignment on the desire for technical training. That is, the trend line is nearly flat across ranks. We refer to these rare alignments by rank as ‘universal opportunities’, and suggest organizations take note.
However, when it comes to the caution sentiment, the skew between individual contributors (46%) and management levels (30%) is notable. Individual contributors seem to show greater caution, and this differing opinion matches the general pattern of previous SRE Reports.
Similar to the 2024 SRE Report, ‘Writing Code’ saw the highest positive sentiment regarding AI's usefulness (39%). Writing code also emerged as the most common AI use case in the 2024 DORA Report. The optimism around AI for coding is understandable, given the universal need to “write more code”. What remains less clear is whether this reliance will ultimately lead to significant software delivery or quality improvements.
Conversely, ‘Release Management’ garnered the lowest positive sentiment, at 17%, down from 27% last year.
Incidents are not rare, nor are they isolated.
No discussion of reliability would be complete without considering its obverse - the outage. As we saw in the first section of this year’s report, it isn’t just full outages that matter anymore; it’s also performance degradations that cause customers to rage-click, go off to another supplier, hate their online lives, or otherwise be unhappy with your service.
So how are companies faring this year in handling incidents?
40% The number of respondents who said they have responded to between 1 and 5 incidents in the last 30 days.
Responding to incidents is part of the job. Understanding the frequency and impact of these events is crucial for improving resilience.
40% of respondents reporting they handled between 1-5 incidents in the past 30 days may seem manageable, but it's important to recognize the cumulative effect of incidents on both individual well-being and team effectiveness. For those dealing with 6-10 incidents per month (23%), the load becomes even more challenging—especially when compounded by other responsibilities.
Curiously, 14% of respondents reported dealing with zero incidents in the last month. This may indicate that their systems are either highly resilient or they are not in the response loop for incidents, or, perhaps, that their monitoring and alerting practices haven't picked up critical signals—a potential concern.
We hypothesized that being an individual contributor would translate into handling more incidents, but our findings tell a different story. Higher-level managers are just as involved, if not more, in incidents. Perhaps it’s because managers get called into most calls and postmortems for each of their direct reports.
This reveals an interesting insight into the dynamics of incident response within teams: managers aren't exempt from incidents. In fact, they are deeply involved in both managing and understanding them. It could also indicate that responsibility increases involvement rather than decreasing it.
Every incident, regardless of severity, contributes to the ongoing stress of maintaining reliability. The majority of respondents said they experienced higher levels of stress during incidents. However, we also looked at how many reported higher levels of stress after incidents. Fourteen percent of respondents said stress levels were higher afterincidents (versus during incidents). For example, if they indicated seldomly getting stressed during incidents, then 14% selected either ‘Sometimes,’ ‘Often,’ or ‘Always’ for increased stress after incidents. This could indicate ongoing repercussions of incidents that were not fully resolved or perhaps the pressure of learning from failures in an environment lacking blameless post-incident practices.
Support levels tend to be higher during incidents compared to afterward, matching the superficial expectation that once an incident is “resolved,” all the work is done. This highlights a potential gap in post-incident support. When your site (or feature) is obviously broken, it is easy to get people engaged to fix that obvious problem, but often the longer work to identify and address the contributing factors is much harder to explain to people and maintain attention or priority.
This isn’t a diatribe. It is a wonderful opportunity to address the IT-to-business communications gap.
I was recently having a conversation with a VP of Software Engineering. During our banter, he said he was about to do a presentation on the importance and evolution of monitoring, and was going to use some of the data from last year’s report.
It is because of examples like this that the SRE Report will always hold a special place in my heart. I started writing it (along with Kurt Andersen and others) in the first year of COVID, even though Catchpoint had started this annual contribution a couple of years before.
But it’s not only about me. It is about all of our readers who give us feedback and suggestions. It is about our contributors who take time out of their day job to offer their views from the field. It is about all practitioners, leaders, or evangelists who share their stories and are sometimes the inspiration for what we choose to research.
For this, I, and the entire report production team, extend our deepest gratitude.
Thank you.
We can’t do it without you. Looking forward to next year’s research.
The SRE Survey, used to generate insights for this report, was open for six weeks during July and August 2024. The survey received 301 responses from all across the world, and from all types of reliability roles and ranks.
Explicit Congestion Notification (ECN) is a longstanding mechanism in place on the IP stack to allow the network help endpoints "foresee" congestion between them. The concept is straightforward… If a close-to-be-congested piece of network equipment, such as a middle router, could tell its destination, "Hey, I'm almost congested! Can you two guys slow down your data transmission? Otherwise, I’m worried I will start to lose packets...", then the two endpoints can react in time to avoid the packet loss, paying only the price of a minor slow down.
ECN bleaching occurs when a network device at any point between the source and the endpoint clears or “bleaches” the ECN flags. Since you must arrive at your content via a transit provider or peering, it’s important to know if bleaching is occurring and to remove any instances.
With Catchpoint’s Pietrasanta Traceroute, we can send probes with IP-ECN values different from zero to check hop by hop what the IP-ECN value of the probe was when it expired. We may be able to tell you, for instance, that a domain is capable of supporting ECN, but an ISP in between the client and server is bleaching the ECN signal.
ECN is an essential requirement for L4S since L4S uses an ECN mechanism to provide early warning of congestion at the bottleneck link by marking a Congestion Experienced (CE) codepoint in the IP header of packets. After receipt of the packets, the receiver echoes the congestion information to the sender via acknowledgement (ACK) packets of the transport protocol. The sender can use the congestion feedback provided by the ECN mechanism to reduce its sending rate and avoid delay at the detected bottleneck.
ECN and L4S need to be supported by the client and server but also by every device within the network path. It only takes one instance of bleaching to remove the benefit of ECN since if any network device between the source and endpoint clears the ECN bits, the sender and receiver won’t find out about the impending congestion. Our measurements examine how often ECN bleaching occurs and where in the network it happens.
ECN has been around for a while but with the increase in data and the requirement for high user experience particularly for streaming data, ECN is vital for L4S to succeed, and major investments are being made by large technology companies worldwide.
L4S aims at reducing packet loss - hence latency caused by retransmissions - and at providing as responsive a set of services as possible. In addition to that, we have seen significant momentum from major companies lately - which always helps to push a new protocol to be deployed.
If ECN bleaching is found, this means that any methodology built on top of ECN to detect congestion will not work.
Thus, you are not able to rely on the network to achieve what you want to achieve, i.e., avoid congestion before it occurs – since potential congestion is marked with Congestion Experienced (CE = 3) bit when detected, and bleaching would wipe out that information.
The causes behind ECN bleaching are multiple and hard to identify, from network equipment bugs to debatable traffic engineering choices and packet manipulations to human error.
For example, bleaching could occur from mistakes such as overwriting the whole ToS field when dealing with DSCP instead of changing only DSCP (remember that DSCP and ECN together compose the ToS field in the IP header).
Nowadays, network operators have a good number of tools to debug ECN bleaching from their end (such as those listed here) – including Catchpoint’s Pietrasanta Traceroute. The large-scale measurement campaign presented here is an example of a worldwide campaign to validate ECN readiness. Individual network operators can run similar measurement campaigns across networks that are important to them (for example, customer or peering networks).
The findings presented here are based on running tests using Catchpoint’s enhanced traceroute, Pietrasanta Traceroute, through the Catchpoint IPM portal to collect data from over 500 nodes located in more than 80 countries all over the world. By running traceroutes on Catchpoint’s global node network, we are able to determine which ISPs, countries and/or specific cities are having issues when passing ECN marked traffic. The results demonstrate the view of ECN bleaching globally from Catchpoint’s unique, partial perspective. To our knowledge, this is one of the first measurement campaigns of its kind.
Beyond the scope of this campaign, Pietrasanta Traceroute can also be used to determine if there is incipient congestion and/or any other kind of alteration and the level of support for more accurate ECN feedback, including if the destination transport layer (either TCP or QUIC) supports more accurate ECN feedback.