Subscribe to our
weekly update
Sign up to receive our latest news via a mobile-friendly weekly email
Virtually every enterprise today is adopting new digital transformation strategies to deliver value to its customers. The enterprise increasingly relies on digital to interact, communicate, sell to, and service its consumers, and increase the efficiency of business operations. Meanwhile, IT is adopting agile development processes to build and deliver digital services. This means businesses are increasingly dependent on cloud providers and other third-party vendors for key components that are critical to the functioning and delivery of their services.
The concept of digital transformation has become a catch-all term and can mean different things to different companies. CIO recently interviewed a number of businesses about their digital vision. The CIOs defined digital in a variety of ways, including:
“Digital is all about how we reach the customer. Traditionally, Western Union has been known as a cash business; our goal now is to digitize the customer experience wherever we can. We measure the success of our digital transformation by tracking the channels our customers choose to use: web, phone, or an agent location. Our transition to digital is all about giving the customer convenience, simplicity, and options.” - Sheri Rhodes, CTO, Western Union formerly; now CIO at Workday
“For Lenovo, digital ranges from basic process optimization, to using technology to unlock new business models, to creating new products, and delivering more empowering customer experiences.” Hu clarified that his company was increasingly focused on AI and ‘Intelligent Transformation’, which he defined as “applying technology, especially AI, in all of the areas of the business to tap into the exploding amounts of data that are becoming more available throughout the enterprise and ecosystem.” - Arthur Hu, CIO, Lenovo
The number of companies providing a digital service to enterprises has skyrocketed in the last decade, with the widespread shift to cloud services. An ever-increasing reliance on cloud providers to deliver digital services is dramatically changing the role of enterprise IT beyond architecting, developing, and monitoring performance, to adopting and governing the cloud.
Enterprises have to figure out which infrastructure types best suit their individual workloads and what the overall profile of infrastructure types should look like. The good news is that the wide range of cloud providers and third-party services available enables businesses to determine the best infrastructure for their specific use cases.
Private cloud solutions that offer dedicated resources are growing the fastest in share of compute resources, suggesting that the adoption of Infrastructure as a Service (IaaS) and Software as a Service (SaaS) resources will continue to grow in the future.
For years, procurement and legal departments have ensured that contracts with service providers contain strict clauses on Service Level Agreements (SLA), including any penalties that will incur if SLAs are broken. As digital becomes an increasing component of everyday business, the importance of SLAs in maintaining good service and governing the cloud is becoming ever more important.
Properly set and enforced SLAs provide the company consuming the service (the customer) with objective grading criteria and protection from the business impact of poor service. The service provider, meanwhile, gains from the opportunity to set appropriate expectations for how its service will be judged and, because it is being held accountable, is incentivized to improve quality of service.
Numerous major service providers have experienced significant outages, demonstrating the importance of having robust SLAs in place. These include:
Having an SLA signed and on file alone is not sufficient protection. The customer must ensure the provider is meeting its SLA, and the provider must ensure it does not breach the SLA.
Additionally, almost all SLAs require the customer of the service to file a breach request to trigger any penalties defined therein. In other words, a service provider will not provide you the credits that the SLA states you are owed unless you specifically request and provide proof of a problem.
When Microsoft 365 had a two-hour outage in April 2019, they breached their SLA with Catchpoint. Even though the outage was publicly known and verified, Catchpoint still had to file an SLA request, which a special Microsoft team then validated and verified, ultimately agreeing to issue a credit to our account.
When we speak to companies before they deploy our Internet Performance Monitoring (IPM) platform, we have frequently discovered that many of them have failed to gather any penalties from their vendors in the event of a breach. This leads to finger pointing and a deterioration of the relationship.
SLA monitoring is an often-overlooked aspect of properly managing SLAs. Typically, a Service Level Agreement is based on monthly data. Therefore, monitoring the agreement on a monthly basis is essential for proactive detection of breaches.
Monitoring your service providers and using the right enforcement strategy can:
This guide focuses on what customers of cloud services can do to measure the performance and manage the SLAs of their various cloud providers, and what traps and pitfalls to avoid along the way.
The term SLA is widely used and has thereby become an umbrella term. A company cannot actually monitor an SLA, which is really just a document that outlines the terms of service. A business can only enforce an SLA by monitoring the metrics referenced within it.
A metric referenced by the SLA, which is a quantitative measure of the level of service, is called a Service Level Indicator (SLI). The SLI is measured in relation to a service level objective (SLO), a goal that cannot be breached. The SLO provides a value or range of values considered acceptable for the SLI. There is generally an upper or lower-bound limit. For example, DNS resolution time may be reported as not taking over 100 ms.
Almost all SLAs for digital services have an availability metric as an SLI. Availability is a measure of time within a given time-period, typically the calendar month, for which the service(s) was reachable and functioning as expected. Therefore, the availability metric measures the reachability of the service from outside the provider’s own infrastructure, and checks that the code behind the service is performing the function as expected.
If you are using Twilio as an SMS delivery service, for instance, the availability metric would measure the fact that Twilio’s API is reachable, and that it properly responded to the API request to send an SMS, in addition to whether the SMS was delivered or not.
Any SLA will outline how the SLI will be measured, the length of time or number of measurements that must be outside the range, and if there are any consequences if the agreement is breached. If there are no consequences, there is no SLA.
Learn more about the difference between SLAs, SLOs, and SLIs in our on-demand webinar, "Solving the SLO Riddle: Why SLOs aren't enough" (watch free, no registration required).
As mentioned earlier, an SLA exists between the customer and the provider of services.
The provider can be an external or internal provider, for example, another group or division within the company. External providers include DNS services, content delivery networks (CDNs), managed service providers, cloud compute, hosting providers, validating services, translation services, API platforms, DDoS protection, Bot Protection, Identity and Access Management (IAM), productivity software, communication and collaboration suites, and more.
Internal SLAs are no less important than external ones. In large enterprises, different groups have different budgets and priorities. However, there are often dependencies on the services, applications, or infrastructure that can result in one division impacting another, and causing a domino effect of outages and a negative impact to business. Internal SLAs help the various enterprise groups hold one another accountable, and ensure that the quality of service delivery takes priority without finger pointing and political infights.
It doesn’t matter whether the provider is internal or external; if the application or service being provided is business critical, an SLO and SLA should be set. But setting an SLO and publishing an agreement is not enough; you also need to accurately measure the service to ensure that the SLOs are being met.
The overall goal of the SLA is to improve the accountability of the provider and ensure that the customer paying for the service can mitigate any risks involved by shifting service to a different provider when necessary.
To improve accountability, it is critical that the right data on the service is collected. It must accurately represent the expectations of the SLA and what the service provider has within their control. Without accurate data, the relationship can quickly turn into a finger pointing exercise in which each party feels that they are right. “Yes, the performance wasn’t what was expected, the customer is right.” “No, the SLA wasn’t breached, the provider is right.” Looking at things from a neutral standpoint, however, often reveals that there is no one right answer.
Secondly, the provider can exercise control over the quality of what is delivered only up to a certain point on the Internet. A provider of a digital service cannot be responsible for what is happening outside the realm of its control.
For example, if there is a major fiber cut in Virginia, that impacts the Internet for most of the Northeast; a content delivery network (CDN) provider will be impacted, but the provider will be unable to do much about the cut and the outage is not their responsibility. Therefore, the first step in bringing accountability and governance to SLAs is to collect objective and accurate measurements.
Some service providers might issue reports on how they are performing in relation to SLAs, but verifying this with your own monitoring will increase confidence in the level of performance being received. It ensures trust on both sides, and means that you won’t feel cheated if and when your organization experiences business pain while your provider’s data appears to show that everything was normal. Where the metrics are monitored from matters.
Enterprises started to monitor SLA metrics in the late 1990s. At the time, businesses focused on monitoring websites, ad-serving systems, or HTTP URLs for their primary providers for hosting companies, ISPs and CDNs.
By contrast, as we have discussed, the typical enterprise today relies much more heavily on external providers of services, which deliver web pages and static image files, as well as DNS, APIs and other important client server protocols.
Websites and applications have also become much more complex with various portions of the end-user flow being impacted by an array of vendors. User login and authentication might be provided by an Identity and Access Management (IAM) provider, while DNS is managed by a DNS provider; a CDN might be involved in accelerating the entire web application; a user’s address might be validated by an API; credit card processing will be handled by one vendor while emailed reports are delivered by someone else.
In other words, many different vendors can take responsibility for different portions of the workflow. To complicate things further, each vendor might rely on others for specific portions of their services, creating an intertwined web of dependencies.
Not all monitoring necessarily enables oversight of all these vendors. Just because a specific service is being monitored with an APM solution, it does not mean it is reachable, working, or the indicator measured by the SLA was not breached. This is due to the fact that services are being delivered by a highly complex architecture that involves multiple layers of providers, applications, infrastructure, and different networks.
A simple third-party service used by an application to validate and correct the user’s mailing address, for instance, might rely on Oracle for the DNS, Akamai for the CDN and security protection, Amazon AWS for infrastructure, Amazon S3 for storing data, the code to be written by the service provider, etc. Each of these components tends to be highly distributed geographically with different transit ISPs and routing policies, and there might be hundreds of virtual machines on which the application runs.
In order to successfully manage the SLAs of multiple vendors, a company must:
In a recent Gartner report on SaaS SLAs, the research agency highlighted the fact that many SLAs are not “sufficiently broad or robust.” Gartner noted the key challenges of negotiating a robust SLA that needs to cover risk as broadly as possible, including failure to specify planned downtime exclusions and limitations, putting the customer at risk thereby of unplanned, and potentially unacceptable, outages at difficult times.
Gartner’s recommendations included the need to:
Sadly, most enterprises do not have a full understanding and/or visibility of the complexity of the relationships between their different service providers, and what this means to them. APM platforms offer visibility only once a major provider has a major outage and don’t really help with monitoring SLAs.
One such event occurred in the fall of 2016 when the DNS provider Dyn, an Oracle company, had a major outage that took down the entire Internet. When the Dyn outage happened, it impacted not just its direct customers, but also the companies who relied on providers who relied on Dyn.
In other words, it was a domino effect.
Most companies were not properly monitoring the full extent of their services, and only became aware of the problem after users complained that portions of their sites or web applications were not working properly.
These companies had no understanding, no visibility, and worst of all, no plan in place as to what to do when such an event took place. This widespread event gave birth to the “multi-DNS” strategy that most IT organizations have put in place since to ensure that DNS can no longer take them down.
Furthermore, an enterprise must ensure that its monitoring strategy monitors the service level indicators involved in key transactions of their application or service; since each individual service involved in a transaction is capable of breaking it.
Monitoring transactions is more complicated than single page monitoring when determining which parameters should be used. Different inputs may yield different results based on application logic or the APIs used to pull information. Using the same inputs may not provide a sufficient level of insight.
For instance, it isn’t feasible to test every permutation, but just checking that the page is up is not sufficient; however, validating that the entry of specific inputs results in specific outputs is very important.
Simply checking that the login page is accessible, for example, doesn’t mean that the user is definitely able to login. You must enter a username and password, ensure the user is authenticated, and can reach a page with the correct data. For example, a certain kind of dashboard with a specific set of numbers and charts.
Catchpoint enables the monitoring of business and end-user workflows. It supports multi-step transactions using two popular open-source frameworks: Playwright and Puppeteer. This makes it easy to create transactions that can be used to detect any issues with key business processes. Logic can be inserted into scripts to choose different search terms, select valid travel dates, or by clicking on the second item in a list, let you dynamically cycle through a relevant list of terms, dates and/or numbers.
Additionally, Catchpoint offers not only monitoring of web transactions through a native Chrome browser that emulates that of end users, but also a myriad range of other services, from API transactions to DNS, email protocols to WebSocket, MQTT to network; even allowing for the building of customized monitoring for services unique to the individual enterprise.
One of the biggest traps related to monitoring metrics for SLAs is where your monitoring agents are located. In the early days of SLA management, the service vendors figured out that if they monitored their services from within their own datacenters, it was to their benefit.
However, buyers quickly learned there were mismatches between their vendor’s SLA metrics and what the users of their companies were experiencing. In the early days of SLAs and Internet-based services, there were frequently cases in which the vendor’s system operator would say everything was green (i.e., operating fine) while the technical support lines were ringing non-stop with furious customers who were experiencing outages.
Monitoring data 24/7 from a probe deployed on the same datacenter as the service might work for internal SLAs where the customers of the service are within the datacenter. However, this won’t work for services consumed externally over the Internet.
A service on the Internet is more than just code on many containers on many physical servers. There are other components of the service architecture that are in the realm of control of the provider, which can cause serious performance issues, outages, and/or unreliable service transactions.
The key components that are always present are a datacenter’s geographical location, transit providers used in the provider’s datacenter, and routers, load balancers, and firewalls.
Lastly, the service provider itself will inevitably rely on third party services in the architecture mesh, which might be unknown to the service consumer, such as CDN, site acceleration, DDoS protection, page optimization, image optimization, cloud computing services (Lambda Functions, S3 Storage, etc.), and other services issued by the provider in other datacenters and so on.
The service provider ultimately controls which vendors are used and applies its own strategies to managing these services and enforcing its own SLAs. Considering that a modern service relies on so many components external to the datacenter, it clearly shows that solely measuring from within is not sufficiently reliable for SLA management.
At the same time, monitoring each end user through Real User Monitoring (RUM) or similar means will not work in most cases. At the edge of the Internet where the end users are, there are many components outside the control of the service provider and even that of the buyer.
Today, an end user (customer or employee) might use a laptop and connect to a coffee shop WiFi, going through a Consumer ISP and an Internet proxy service.
Any one of these components can introduce an outage or poor performance from an end-user perspective, but the provider cannot be held accountable for their failure. At the same time, Real User Monitoring will only have data available if the user is able to successfully access the service. Therefore, you won’t have a constant stream of data available 24/7 to properly ascertain if an SLA has been breached. Moreover, when an outage occurs, you will have no data at all.
Since we cannot monitor from within the datacenter or from the end user, the only place left is somewhere in between the provider’s datacenter and the end user where the provider can actually be held accountable.
Monitoring from a single location isn’t effective given the varied geographical and ISP distribution. You need to measure from a range of vantage points that match where your users are geographically located to ensure visibility of network issues that are in the hands of the provider. The wider number of vantage points, the easier it is to see if issues are regional or global, and to have enough coverage for the distribution of the service, which could be in multiple datacenters and rely on multiple transit providers.
At the same time, having geography diversification is not enough. In different geographies, there are different transit providers that are key to the delivery of Internet data. Any service provider needs to ensure their network is accessible from all key backbone locations where communications will traverse through.
One of the latest monitoring solutions offered by APM vendors is synthetics from the cloud. Such monitoring solutions offer a cost-effective way to monitor issues caused by your code. It is okay to measure from the cloud if that is where your application is hosted, but that is still only one piece of the puzzle.
Synthetics from the cloud will continue fall short with regard to SLA enforcement for the following reasons:
One of the most common traps in SLA management is that companies negotiate each vendor SLA separately, and the negotiation is too often not conducted by those who are actually in charge of the design and architecture of the service. This results in the company not looking at what the SLA of a given vendor means to the overall service SLA.
To illustrate the problem, let’s look in more detail at an example we briefly touched on earlier: a company has built a web-based application for selling data intelligence; its customers can purchase datasets as needed.
The service relies on:
The company has a practice of setting all their availability SLAs to 99.9%, or 43m 49s of downtime per month. They decide to add a new vendor for DDoS protection, which sits between the CDN and the cloud provider; all transactions are shielded by the DDoS protection service. The team assumes they can negotiate an SLA of 99.9% as they have done previously.
However, they do not take into account that they themselves have internal SLAs with customers and partners. They mistakenly assume that they are covered, and will not have availability below 99.9% themselves. In reality however, the SLA of their application would be breached at 99.9% if more than one vendor breaches.
Their highest risk is if all of them breach, which would mean:
To avoid breaching their own SLAs with customers, or to reduce downtime for their users, a company must figure out how to reduce the risks beyond what the SLAs with each vendor allows them.
SLA management is not just about holding vendors accountable. It is also about your IT organization ensuring reliable services independent of any vendor failure. This responsibility increasingly lies in the hands of Site Reliability Engineers (SREs), a discipline first introduced by Google, which has become significantly more prominent among sophisticated IT organizations in recent years.
For the last several years, we have been surveying SREs to understand more about this emerging role and any outages, incidents, and post-incident stress they encounter in their work.
In our 2019 SRE Report, 49% of respondents stated they had been involved in incident resolution within the last week alone. When vendor failure happens, your organization should have in place a “vendor reliability strategy” by either implementing a multi-vendor reliability strategy or, depending on the service, a backup reliability vendor strategy.
Either way, you must have an monitoring strategy that supports your SLA strategy and a reliability strategy that involves real-time monitoring and alerting of the vendor’s service and yours.
In a multi-vendor reliability strategy, your organization ensures it has at least two vendors in an active-active mode all the time for the same service. This strategy allows IT organizations to architect their applications and services in such a way that the redundant vendors make up for any failures of their counterpart.
There are four key services in which this strategy has been successfully implemented:
There are several challenges with implementing such a strategy, however. Your team needs to take these into account before going forward with this kind of approach:
A backup reliability strategy, by contrast, means that you have a backup plan in place for each key service which is outsourced to external vendors. Such a strategy is simpler and more cost effective, but you will still likely experience some impact from outages.
The backup strategy might be used in any one of these cases:
In today’s cloud era, no company can implement a successful digital service without a rock-solid monitoring strategy and an SLA enforcement policy to support it. It would be like building and running a business without having a dashboard of key performance indicators on how the business is operating.
Your monitoring strategy must not only support the ability to monitor your vendors’ SLAs, but also support your team in the implementation and management of any reliability strategies in place for digital services.
Your monitoring strategy must include the following capabilities:
The rise of digital services and the digital transformation movement in the enterprise typically relies on the use of cloud-based services provided by external vendors. This has increased the risk of poor user experience and a negative impact on business, which in turn has pushed organizations into the robust use of Service Legal Agreements.
However, many companies today continue to lack proper Service Level Management practices. As a result, they are hit by adverse consequences when a breach occurs because they don’t possess the means to hold their providers to account. The unfortunate outcome is that they are losing valuable customers.
Successful IT organizations have also reacted to potential vendor challenges by implementing multi-vendor or backup reliability strategies.
All these approaches, however, can only be successful alongside a broad and robust digital service monitoring strategy, which allows companies to monitor their internal and external vendors to both hold them accountable and mitigate any outages in real time.
Explicit Congestion Notification (ECN) is a longstanding mechanism in place on the IP stack to allow the network help endpoints "foresee" congestion between them. The concept is straightforward… If a close-to-be-congested piece of network equipment, such as a middle router, could tell its destination, "Hey, I'm almost congested! Can you two guys slow down your data transmission? Otherwise, I’m worried I will start to lose packets...", then the two endpoints can react in time to avoid the packet loss, paying only the price of a minor slow down.
ECN bleaching occurs when a network device at any point between the source and the endpoint clears or “bleaches” the ECN flags. Since you must arrive at your content via a transit provider or peering, it’s important to know if bleaching is occurring and to remove any instances.
With Catchpoint’s Pietrasanta Traceroute, we can send probes with IP-ECN values different from zero to check hop by hop what the IP-ECN value of the probe was when it expired. We may be able to tell you, for instance, that a domain is capable of supporting ECN, but an ISP in between the client and server is bleaching the ECN signal.
ECN is an essential requirement for L4S since L4S uses an ECN mechanism to provide early warning of congestion at the bottleneck link by marking a Congestion Experienced (CE) codepoint in the IP header of packets. After receipt of the packets, the receiver echoes the congestion information to the sender via acknowledgement (ACK) packets of the transport protocol. The sender can use the congestion feedback provided by the ECN mechanism to reduce its sending rate and avoid delay at the detected bottleneck.
ECN and L4S need to be supported by the client and server but also by every device within the network path. It only takes one instance of bleaching to remove the benefit of ECN since if any network device between the source and endpoint clears the ECN bits, the sender and receiver won’t find out about the impending congestion. Our measurements examine how often ECN bleaching occurs and where in the network it happens.
ECN has been around for a while but with the increase in data and the requirement for high user experience particularly for streaming data, ECN is vital for L4S to succeed, and major investments are being made by large technology companies worldwide.
L4S aims at reducing packet loss - hence latency caused by retransmissions - and at providing as responsive a set of services as possible. In addition to that, we have seen significant momentum from major companies lately - which always helps to push a new protocol to be deployed.
If ECN bleaching is found, this means that any methodology built on top of ECN to detect congestion will not work.
Thus, you are not able to rely on the network to achieve what you want to achieve, i.e., avoid congestion before it occurs – since potential congestion is marked with Congestion Experienced (CE = 3) bit when detected, and bleaching would wipe out that information.
The causes behind ECN bleaching are multiple and hard to identify, from network equipment bugs to debatable traffic engineering choices and packet manipulations to human error.
For example, bleaching could occur from mistakes such as overwriting the whole ToS field when dealing with DSCP instead of changing only DSCP (remember that DSCP and ECN together compose the ToS field in the IP header).
Nowadays, network operators have a good number of tools to debug ECN bleaching from their end (such as those listed here) – including Catchpoint’s Pietrasanta Traceroute. The large-scale measurement campaign presented here is an example of a worldwide campaign to validate ECN readiness. Individual network operators can run similar measurement campaigns across networks that are important to them (for example, customer or peering networks).
The findings presented here are based on running tests using Catchpoint’s enhanced traceroute, Pietrasanta Traceroute, through the Catchpoint IPM portal to collect data from over 500 nodes located in more than 80 countries all over the world. By running traceroutes on Catchpoint’s global node network, we are able to determine which ISPs, countries and/or specific cities are having issues when passing ECN marked traffic. The results demonstrate the view of ECN bleaching globally from Catchpoint’s unique, partial perspective. To our knowledge, this is one of the first measurement campaigns of its kind.
Beyond the scope of this campaign, Pietrasanta Traceroute can also be used to determine if there is incipient congestion and/or any other kind of alteration and the level of support for more accurate ECN feedback, including if the destination transport layer (either TCP or QUIC) supports more accurate ECN feedback.