Subscribe to our
weekly update
Sign up to receive our latest news via a mobile-friendly weekly email
In today’s always-on digital world, the availability, performance, and resilience of your Internet-based services such as websites, applications, APIs, and mobile apps can make or break business success. As a CIO or IT leader overseeing Site Reliability Engineers (SREs), Network Engineers, and Development Teams, ensuring these services operate seamlessly is paramount.
This buyer’s guide simplifies the process of choosing the right Internet Performance Monitoring (IPM) platform by:
Internet-based applications rely on a complex stack of technologies, systems and services referred to as the Internet Stack. It includes multiple layers, from the physical network infrastructure to application-level protocols and services. Each layer is interdependent, so when something goes wrong, it can ripple across the entire system, affecting performance and availability.
The Internet Stack comprises these critical layers:
The complexity of the Internet Stack necessitates a comprehensive monitoring strategy that provides visibility into each layer and real-world application experience from the user's perspective. This approach helps you identify potential performance issues early, enabling swift resolution.
Traditional Application Performance Monitoring (APM) solutions fall short in this regard. Here’s why APM is no longer enough for today’s complex Internet Stack.
APM emerged 30 years ago for monolithic, 3-tier applications on local, on-premises infrastructure without virtualization, where infrastructure metrics lacked elasticity. While APM has good insight into application logs, traces, and infrastructure metrics, it faces several limitations and challenges.
Due to these limitations, APM tools are often insufficient for ensuring the performance and availability of internet-based services. This gap is where IPM platforms come into play.
To ensure a robust and reliable IPM platform, prioritize the following features:
Many organizations default to monitoring only the systems within their direct control. However, this approach overlooks a crucial aspect: it doesn’t reflect the user experience. An effective IPM platform should provide insights into real user experiences, helping IT teams identify and address performance bottlenecks.
A strong IPM platform should offer:
Monitoring digital experiences from users' perspective requires a global network of intelligent agents. Having these agents in the right locations can shed light on crucial questions during an outage or performance degradation:
Ensure the platform offers:
An effective IPM platform identifies and resolves issues before they impact users.
Key features include:
Understanding how data travels across the Internet is key to spotting bottlenecks and disruptions.
An effective IPM platform should provide detailed visibility into the network path, including:
A comprehensive IPM platform should collect a wide range of performance metrics across all layers of the Internet Stack, offering a complete view of service performance.
Key metrics to look for:
An effective IPM platform offers real-time monitoring to detect issues instantly and historical data for trend analysis and optimization.
Look for:
A good IPM platform helps IT teams visualize and share insights with stakeholders through:
As organizations grow and their digital footprint expands, an effective IPM platform must be scalable and flexible to meet evolving needs.
Key features include:
Effective SLO monitoring is critical for maintaining high performance standards and accountability across internal teams and vendors.
An IPM platform should support:
APM tools often fall short, missing critical user-impacting issues and lacking the data needed for accurate SLO and XLO measurement. Synthetic monitoring from AWS may confirm a vendor’s service is working within AWS, but this narrow approach fails to provide the transparency and accountability required for today’s complex Internet environment.
Finding the right platform to ensure the availability, performance, and resilience of Internet-based services is no small task. With so many criteria to consider—spanning global coverage, real-time insights, and advanced analytics—it can be difficult to determine which solution truly meets your organization’s needs.
Industry benchmarks, such as Gartner’s Magic Quadrant for Digital Experience Monitoring (DEM), can help guide your decision. Recognizing leaders and their focus areas is a great way to identify the most effective solutions. Catchpoint, named a Leader in Gartner’s first Magic Quadrant for DEM, exemplifies excellence in bridging this gap.
Explicit Congestion Notification (ECN) is a longstanding mechanism in place on the IP stack to allow the network help endpoints "foresee" congestion between them. The concept is straightforward… If a close-to-be-congested piece of network equipment, such as a middle router, could tell its destination, "Hey, I'm almost congested! Can you two guys slow down your data transmission? Otherwise, I’m worried I will start to lose packets...", then the two endpoints can react in time to avoid the packet loss, paying only the price of a minor slow down.
ECN bleaching occurs when a network device at any point between the source and the endpoint clears or “bleaches” the ECN flags. Since you must arrive at your content via a transit provider or peering, it’s important to know if bleaching is occurring and to remove any instances.
With Catchpoint’s Pietrasanta Traceroute, we can send probes with IP-ECN values different from zero to check hop by hop what the IP-ECN value of the probe was when it expired. We may be able to tell you, for instance, that a domain is capable of supporting ECN, but an ISP in between the client and server is bleaching the ECN signal.
ECN is an essential requirement for L4S since L4S uses an ECN mechanism to provide early warning of congestion at the bottleneck link by marking a Congestion Experienced (CE) codepoint in the IP header of packets. After receipt of the packets, the receiver echoes the congestion information to the sender via acknowledgement (ACK) packets of the transport protocol. The sender can use the congestion feedback provided by the ECN mechanism to reduce its sending rate and avoid delay at the detected bottleneck.
ECN and L4S need to be supported by the client and server but also by every device within the network path. It only takes one instance of bleaching to remove the benefit of ECN since if any network device between the source and endpoint clears the ECN bits, the sender and receiver won’t find out about the impending congestion. Our measurements examine how often ECN bleaching occurs and where in the network it happens.
ECN has been around for a while but with the increase in data and the requirement for high user experience particularly for streaming data, ECN is vital for L4S to succeed, and major investments are being made by large technology companies worldwide.
L4S aims at reducing packet loss - hence latency caused by retransmissions - and at providing as responsive a set of services as possible. In addition to that, we have seen significant momentum from major companies lately - which always helps to push a new protocol to be deployed.
If ECN bleaching is found, this means that any methodology built on top of ECN to detect congestion will not work.
Thus, you are not able to rely on the network to achieve what you want to achieve, i.e., avoid congestion before it occurs – since potential congestion is marked with Congestion Experienced (CE = 3) bit when detected, and bleaching would wipe out that information.
The causes behind ECN bleaching are multiple and hard to identify, from network equipment bugs to debatable traffic engineering choices and packet manipulations to human error.
For example, bleaching could occur from mistakes such as overwriting the whole ToS field when dealing with DSCP instead of changing only DSCP (remember that DSCP and ECN together compose the ToS field in the IP header).
Nowadays, network operators have a good number of tools to debug ECN bleaching from their end (such as those listed here) – including Catchpoint’s Pietrasanta Traceroute. The large-scale measurement campaign presented here is an example of a worldwide campaign to validate ECN readiness. Individual network operators can run similar measurement campaigns across networks that are important to them (for example, customer or peering networks).
The findings presented here are based on running tests using Catchpoint’s enhanced traceroute, Pietrasanta Traceroute, through the Catchpoint IPM portal to collect data from over 500 nodes located in more than 80 countries all over the world. By running traceroutes on Catchpoint’s global node network, we are able to determine which ISPs, countries and/or specific cities are having issues when passing ECN marked traffic. The results demonstrate the view of ECN bleaching globally from Catchpoint’s unique, partial perspective. To our knowledge, this is one of the first measurement campaigns of its kind.
Beyond the scope of this campaign, Pietrasanta Traceroute can also be used to determine if there is incipient congestion and/or any other kind of alteration and the level of support for more accurate ECN feedback, including if the destination transport layer (either TCP or QUIC) supports more accurate ECN feedback.