Learn

Choosing the right IPM platform: A buyers guide for IT leaders

In today’s always-on digital world, the availability, performance, and resilience of your Internet-based services such as websites, applications, APIs, and mobile apps can make or break business success. As a CIO or IT leader overseeing Site Reliability Engineers (SREs), Network Engineers, and Development Teams, ensuring these services operate seamlessly is paramount.

This buyer’s guide simplifies the process of choosing the right Internet Performance Monitoring (IPM) platform by:

Outlining the components of the new Internet Stack every company needs to monitor.
Highlighting the limitations of traditional monitoring solutions.
Detailing what to look for when purchasing an IPM platform.
Offering a checklist of essential requirements.

The complexity of the Internet Stack

Internet-based applications rely on a complex stack of technologies, systems and services referred to as the Internet Stack. It includes multiple layers, from the physical network infrastructure to application-level protocols and services. Each layer is interdependent, so when something goes wrong, it can ripple across the entire system, affecting performance and availability.

A blue and purple rectangular chart with iconsDescription automatically generated with medium confidence

The Internet Stack comprises these critical layers:

‍Application - The topmost layer of the Internet Stack, where users connect with digital services like shopping, reservations and collaboration tools.
‍Media and Advertising - Powers analytics, ad delivery, and customer experiences through content and advertising.
‍Cloud Services - Provides hosting, APIs, and recovery solutions essential for modern computing.
‍Internet Core - The backbone of the Internet, including DNS, CDNs, BGP, and edge computing.
‍Protocol - The rulebook for Internet communication, allowing devices to talk to each other smoothly.
‍Network - The vast web of connections that forms the Internet. It includes everything from your local Wi-Fi to global infrastructure.

The complexity of the Internet Stack necessitates a comprehensive monitoring strategy that provides visibility into each layer and real-world application experience from the user's perspective. This approach helps you identify potential performance issues early, enabling swift resolution.

Traditional Application Performance Monitoring (APM) solutions fall short in this regard. Here’s why APM is no longer enough for today’s complex Internet Stack.

APM challenges and limitations

APM emerged 30 years ago for monolithic, 3-tier applications on local, on-premises infrastructure without virtualization, where infrastructure metrics lacked elasticity. While APM has good insight into application logs, traces, and infrastructure metrics, it faces several limitations and challenges.

‍No end user perspective - APM has no ability to proactively test from the user’s location leads to UX blindspots, as performance can vary significantly depending on location, network provider, and device type.
Limited SaaS & PaaS visibility - APM provides limited or no data on SaaS and PaaS applications result in missed insights into potential infrastructure or integration bottlenecks, making you reactive to issues beyond your control.
No external visibility - APM only gives a limited view of the Internet Stack or third-party dependencies obscures the identification of external factors like Internet outages, overloaded CDNs, or API slowdowns, leading to finger-pointing and delayed troubleshooting.
No global performance visibility - APM has no visibility into global or regional performance hinders understanding of how users indifferent locations experience your application, potentially causing dissatisfaction among customers in specific regions.
High alert thresholds - APM alerts only trigger when a threshold percentage of errors is reached. This approach might cause you to miss subtle but consistent performance issues, leading to gradual user frustration before the threshold is met.
No real-world resilience - APM can’t measure real-world resilience. This might leave you unprepared for unexpected traffic surges or infrastructure failures, leading to downtime and revenue loss.

Due to these limitations, APM tools are often insufficient for ensuring the performance and availability of internet-based services. This gap is where IPM platforms come into play.

Key features to look for in an Internet Performance Monitoring Platform

To ensure a robust and reliable IPM platform, prioritize the following features:

#1 - End-user experience monitoring

Many organizations default to monitoring only the systems within their direct control. However, this approach overlooks a crucial aspect: it doesn’t reflect the user experience. An effective IPM platform should provide insights into real user experiences, helping IT teams identify and address performance bottlenecks.

A strong IPM platform should offer:

Real User Monitoring (RUM) - Collecting data from actual users to provide insights into how they experience the application.
Internet Synthetics - Synthetic monitoring limits you to running web and API tests from cloud locations, whereas Internet synthetic monitoring enables you to emulate actual end-user behavior. This approach helps you identify performance issues before they impact real users.
Transaction monitoring - Tracks specific user transactions to ensure they perform as expected.A global network of intelligent agents

Monitoring digital experiences from users' perspective requires a global network of intelligent agents. Having these agents in the right locations can shed light on crucial questions during an outage or performance degradation:  

Is the issue localized to a specific geography, or is its impact global?
Are only users from certain Internet Service Providers (ISPs) affected?
What is the quality of the user experience over mobile networks?
Should we solely depend on cloud agent data, considering users aren’t restricted to cloud environments?

Ensure the platform offers:

Wide coverage - A global network of intelligent agents, including coverage in key regions where your user base is concentrated.
Reliable network connectivity - Intelligent agents should be on single-homed connections, which are better for synthetic monitoring as they provide consistent and precise baseline measurements by routing all traffic through a single ISP, avoiding the noise and unpredictability of multi-homed nodes.
Edge visibility - IPM platforms must detect and manage regional outages or micro-outages at the edge, requiring more advanced monitoring than synthetic monitoring from centralized locations like AWS.

#2 - Proactive issue detection and resolution

An effective IPM platform identifies and resolves issues before they impact users.

Key features include:

Alerting - Customizable alerting capabilities to notify IT teams of performance issues based on predefined thresholds and conditions.
Root cause analysis - AI powered tools to help diagnose the root cause of performance issues, including network path analysis and dependency mapping.
Real-time dependency mapping - Visualize service components with live status updates.
Automated remediation - Workflows to quickly resolve common performance issues, reducing the time to resolution.

#3 - Network path analysis

Understanding how data travels across the Internet is key to spotting bottlenecks and disruptions.

An effective IPM platform should provide detailed visibility into the network path, including:

Traceroute analysis - Visualizes the path data takes from the monitoring location to the destination, identifying potential delays or failures.
Routing analysis - Tracks BGP routing changes to detect hijacks and reachability issues, ensuring stable and secure routing paths.
ISP performance monitoring - Evaluates ISP performance to identify and resolve connectivity issues.

#4 - Comprehensive metric collection

A comprehensive IPM platform should collect a wide range of performance metrics across all layers of the Internet Stack, offering a complete view of service performance.

Key metrics to look for:

Network performance - Latency, packet loss, jitter, and throughput.
Application performance - Page load times, response times, and error rates.
DNS performance - DNS resolution times, including lookup times and propagation delays.
CDN performance - Delivery speeds and resilience of content delivery networks.
Third-party services - Performance of APIs and external services.
BGP Monitoring - Detects routing issues like hijacks or reachability problems.

#5 - Real-time and historical data

An effective IPM platform offers real-time monitoring to detect issues instantly and historical data for trend analysis and optimization.

Look for:

Real-time insights - Continuous, real-time monitoring of performance metrics, enabling rapid issue detection.
Historical data analysis - Access to historical performance data for trend analysis, capacity planning, and post-incident analysis.
Data granularity - High-resolution data granularity to provide detailed insights into performance fluctuations.

#6 - Customizable Dashboards and Reporting

A good IPM platform helps IT teams visualize and share insights with stakeholders through:

Custom dashboards - Tailor views to highlight key metrics and trends.
Automated reporting - Schedule reports using predefined templates.
Data Export - The ability to export performance data in various formats for further analysis and integration with other tools.

#7 - Scalability and Flexibility

As organizations grow and their digital footprint expands, an effective IPM platform must be scalable and flexible to meet evolving needs.

Key features include:

Scalability - Handles increased monitoring as your organization expands.
Flexibility - Support for monitoring a wide range of applications, services, and environments, including on-premises, cloud, and hybrid environments.
APIs and integrations - Seamlessly connects with existing IT tools and infrastructure.

#8 - Robust SLO monitoring for services and vendors

Effective SLO monitoring is critical for maintaining high performance standards and accountability across internal teams and vendors.

An IPM platform should support:

Service reliability - Continuously measure and report on SLOs to ensure performance targets are met and issues are promptly addressed.
Long term data retention - Provide long term data retention to hold vendors accountable to their service level agreements (SLAs) by providing clear performance metrics and insights.
Operational excellence - Empower internal teams with actionable data to identify improvement areas and drive reliability.
Experience Level Objectives (XLOs) - Monitor user experience as an SLO, proactively addressing issues before they impact end users.

APM tools often fall short, missing critical user-impacting issues and lacking the data needed for accurate SLO and XLO measurement. Synthetic monitoring from AWS may confirm a vendor’s service is working within AWS, but this narrow approach fails to provide the transparency and accountability required for today’s complex Internet environment.

Why leaders choose Catchpoint for Internet Performance Monitoring

Finding the right platform to ensure the availability, performance, and resilience of Internet-based services is no small task. With so many criteria to consider—spanning global coverage, real-time insights, and advanced analytics—it can be difficult to determine which solution truly meets your organization’s needs.

Industry benchmarks, such as Gartner’s Magic Quadrant for Digital Experience Monitoring (DEM), can help guide your decision. Recognizing leaders and their focus areas is a great way to identify the most effective solutions. Catchpoint, named a Leader in Gartner’s first Magic Quadrant for DEM, exemplifies excellence in bridging this gap.

What is ECN?

Explicit Congestion Notification (ECN) is a longstanding mechanism in place on the IP stack to allow the network help endpoints "foresee" congestion between them. The concept is straightforward… If a close-to-be-congested piece of network equipment, such as a middle router, could tell its destination, "Hey, I'm almost congested! Can you two guys slow down your data transmission? Otherwise, I’m worried I will start to lose packets...", then the two endpoints can react in time to avoid the packet loss, paying only the price of a minor slow down.

What is ECN bleaching?

ECN bleaching occurs when a network device at any point between the source and the endpoint clears or “bleaches” the ECN flags. Since you must arrive at your content via a transit provider or peering, it’s important to know if bleaching is occurring and to remove any instances.

With Catchpoint’s Pietrasanta Traceroute, we can send probes with IP-ECN values different from zero to check hop by hop what the IP-ECN value of the probe was when it expired. We may be able to tell you, for instance, that a domain is capable of supporting ECN, but an ISP in between the client and server is bleaching the ECN signal.

Why is ECN important to L4S?

ECN is an essential requirement for L4S since L4S uses an ECN mechanism to provide early warning of congestion at the bottleneck link by marking a Congestion Experienced (CE) codepoint in the IP header of packets. After receipt of the packets, the receiver echoes the congestion information to the sender via acknowledgement (ACK) packets of the transport protocol. The sender can use the congestion feedback provided by the ECN mechanism to reduce its sending rate and avoid delay at the detected bottleneck.

ECN and L4S need to be supported by the client and server but also by every device within the network path. It only takes one instance of bleaching to remove the benefit of ECN since if any network device between the source and endpoint clears the ECN bits, the sender and receiver won’t find out about the impending congestion. Our measurements examine how often ECN bleaching occurs and where in the network it happens.

Why is ECN and L4S in the news all of a sudden?

ECN has been around for a while but with the increase in data and the requirement for high user experience particularly for streaming data, ECN is vital for L4S to succeed, and major investments are being made by large technology companies worldwide.

L4S aims at reducing packet loss - hence latency caused by retransmissions - and at providing as responsive a set of services as possible. In addition to that, we have seen significant momentum from major companies lately - which always helps to push a new protocol to be deployed.

What is the impact of ECN bleaching?

If ECN bleaching is found, this means that any methodology built on top of ECN to detect congestion will not work.

Thus, you are not able to rely on the network to achieve what you want to achieve, i.e., avoid congestion before it occurs – since potential congestion is marked with Congestion Experienced (CE = 3) bit when detected, and bleaching would wipe out that information.

What are the causes behind ECN bleaching?

The causes behind ECN bleaching are multiple and hard to identify, from network equipment bugs to debatable traffic engineering choices and packet manipulations to human error.

For example, bleaching could occur from mistakes such as overwriting the whole ToS field when dealing with DSCP instead of changing only DSCP (remember that DSCP and ECN together compose the ToS field in the IP header).

How can you debug ECN bleaching?

Nowadays, network operators have a good number of tools to debug ECN bleaching from their end (such as those listed here) – including Catchpoint’s Pietrasanta Traceroute. The large-scale measurement campaign presented here is an example of a worldwide campaign to validate ECN readiness. Individual network operators can run similar measurement campaigns across networks that are important to them (for example, customer or peering networks).

What is the testing methodology?

The findings presented here are based on running tests using Catchpoint’s enhanced traceroute, Pietrasanta Traceroute, through the Catchpoint IPM portal to collect data from over 500 nodes located in more than 80 countries all over the world. By running traceroutes on Catchpoint’s global node network, we are able to determine which ISPs, countries and/or specific cities are having issues when passing ECN marked traffic. The results demonstrate the view of ECN bleaching globally from Catchpoint’s unique, partial perspective. To our knowledge, this is one of the first measurement campaigns of its kind.

Beyond the scope of this campaign, Pietrasanta Traceroute can also be used to determine if there is incipient congestion and/or any other kind of alteration and the level of support for more accurate ECN feedback, including if the destination transport layer (either TCP or QUIC) supports more accurate ECN feedback.

The content of this page is Copyright 2024 by Catchpoint. Redistribution of this data must retain the above notice (i.e. Catchpoint copyrighted or similar language), and the following disclaimer.

THE DATA ABOVE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS OR INTELLECTUAL PROPERTY RIGHT OWNERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THIS DATA OR THE USE OR OTHER DEALINGS IN CONNECTION WITH THIS DATA.
‍
We are happy to discuss or explain the results if more information is required. Further details per region can be released upon request.

January 24, 2025