Blog Post

The $1 Million Lesson: Building a Culture of Quality Through SLAs

Updated

Published

March 7, 2025

mins read

Mehdi Daoudi

in this blog post

Heading 2

In the early days of DoubleClick, back when SaaS was still known as Application Service Provider (ASP), I was tasked with setting up the QoS (Quality of Service) Team. Our primary mission was to establish a monitoring system, but we quickly found ourselves managing Service Level Agreements (SLAs)—a task that became critical after we paid out over $1 million in penalties for SLA violations to a single customer. The reason? Someone had signed a contract promising 100% uptime, an impossible commitment.

This is the story of how we took control of our SLAs, stopped the financial bleeding, and built a culture of quality around service metrics. Whether you’re managing SLAs today or just curious about how they work, this post will provide valuable insights into the challenges we faced, the solutions we implemented, and the lessons we learned along the way.

What are SLAs?

A clipboard with a pen and paper clipsAI-generated content may be incorrect.

An SLA (Service Level Agreement) is a contractual agreement between a vendor and a customer that outlines the expected level of service. Under this legal umbrella, you’ll find Service Level Objectives (SLOs), which define specific metrics like uptime, speed, or transactions per second.

At DoubleClick, we defined SLAs with the following principles in mind:

Attainable: The goals should be realistic.

Repeatable: The metrics should be consistently measurable.

Measurable: The performance should be quantifiable.

Meaningful: The metrics should matter to the business.

Mutually Acceptable: Both parties should agree on the terms.

SLAs benefit both the customer and the vendor. For customers, they provide objective grading criteria and protection from poor service. For vendors, they set clear expectations and incentivize quality improvements

Ground zero, discovery

When we first tackled the SLA problem, we were in crisis mode. The first step was to compile a list of all contracts, extract the SLAs and SLOs, and document the associated penalties. We stored this information in a database and began educating stakeholders—business leaders, legal teams, and executives—about the importance of SLAs.

From the beginning, we focused on end-user experience-based SLAs. This meant measuring performance from the user’s perspective, not just from the server’s perspective.

A universal challenge

Over the years, I’ve seen many companies face similar issues. Not all SRE and Dev teams fully grasp the SLAs their organization has with customers—they often focus heavily on internal SLOs while overlooking how those metrics tie directly to contractual commitments. For instance, after facing significant penalties, companies like Slack revised their SLA terms to better align internal goals with customer promises.

SLA Application Performance

A person typing on a computerAI-generated content may be incorrect.

Establishing an SLA is more than just putting a few sentences in the contract. The reason we paid $1 million is that there was no SLA Management System in place. We started then by building a Service Level Management (SLM) practice that relied on 4 pillars: Administration, Monitoring, Reporting, and Compliance (AMRC).

The SLM process

We sat down with business partners, customers, legal, and finance teams to create a process that would prevent costly mistakes in the future. This process, which we called the SLA lifecycle, was reviewed quarterly to ensure it remained effective and aligned with our business goals.

Risk simulations with data science: One of the most critical steps in our SLM process was using our in-house data scientists to run simulations. These simulations analyzed historical data from our monitoring tools to assess the risk of breaching SLAs. The goal was to set realistic SLAs that wouldn’t be breached every day, while still meeting customer expectations.

“What-if” scenarios: We also ran multiple “what-if” scenarios to understand the relationship between availability and revenue. These scenarios helped us evaluate the impact of downtime at different hours of the day and days of the week. For example, we could see how a 10-minute outage during peak traffic hours would affect revenue compared to the same outage during off-peak times.

The SLA desk: To streamline the process, we created an online tool in 2001—essentially an “SLA desk”—that allowed our sales team to request SLA portfolios for customers. These requests were reviewed and approved by our QoS team, ensuring that every SLA was realistic, measurable, and aligned with our capabilities.

Aligning external and internal SLAs

One of the biggest challenges we faced was the mismatch between external SLAs (what we promised customers) and internal SLAs (what we measured internally). For example, customers would ask for ad-serving uptime, while our tech team measured server availability.

To solve this, we aligned our external and internal SLOs and made the internal objectives (the targets) very very high. This was a huge victory because it allowed us to rely on one set of metrics to understand our SLA risk position and drive operational excellence. Our tech group (Ops, Engineering, etc.) also became more sensitive to the notion of a business SLA and started to care a lot about not breaching them.

Monitoring – the key to SLA success

For availability and performance, we relied on three synthetic products. Internally, we ran Sitescope in 17 data centers and used two external synthetic products. We wanted to have as many data points as possible from as many tools as possible. The stakes were just too high not to invest in multiple tools. This entire SLM project was not cheap to implement and run on an annual basis, but I also knew the cost of not doing it right the hard way.

For monitoring, it became clear we needed to test as often as possible from as many vantage points as possible:

If you only check your SLO endpoints once an hour, you must wait 59 minutes between checks. That gap can lead to false downtime alerts.

You also need many data points to ensure statistical significance. Smaller datasets lower precision and power, while larger one’s help manage false positives and false negatives.

Enter Differential Performance Measurement (DPM)

One of our biggest challenges was finding an effective way to measure the ad delivery speed and capture it in our SLAs. Clients would look at their site performance and notice spikes and they would attribute it to our system, meanwhile our performance telemetry would not show any problems. We couldn’t correlate the two charts; therefore we couldn’t come to an agreement whether it was our problem or someone else’s problem.

To address this, we developed a methodology called Differential Performance Measurement (DPM). Our goal was to measure Doubleclick’s performance and availability with precision, and to understand how it affected our customers’ pages. We also wanted to be accountable for what we controlled, so we could avoid blame and finger-pointing.

The methodology added context to the measurements. DPM introduced clarity and comparison, removing absolute performance numbers from the SLAs.

Recipe for Differential Performance Measurement (example with an advert.):

Take two pages—one without ads and one with a single ad call.

Page A = No ads

Page B = One ad

Make sure the pages do not contain any other third-party references (CDNs, etc.).

Make sure the page sizes (in KB) are the same.

“Bake” – Measure response times for both pages and you get the following metrics:

Differential Response (DR) will be (Response Time of page B) minus (Response Time of page A)

Differential Response Percentage (DRP) = DR / A. (e.g. If Page A is 2 seconds, and Page B is 2.1 seconds, DR is 0.1 second, and DRP is 0.1/2=0.05 or 5%)

This approach helped eliminate noise caused by:

Internet-related issues beyond our control (e.g., fiber cuts).

Monitoring agent inconsistencies (raising the need to monitor our monitoring tools).

Other third-party dependencies.

To visualize the impact of Differential Performance Measurement (DPM), the chart below compares response times for two scenarios

Scenario 1: The ad-serving company experienced performance issues, which negatively impacted the customer’s site. The vendor breached the SLA threshold between Time 4 and Time 8.

Scenario 2: The website itself encountered performance problems, unrelated to the ad-serving company.

Reporting – Transparency and Accountability

After the $1 million penalty, SLA management became a top priority, with visibility extending all the way to the CEO. We reported monthly on compliance and breaches, using tools like DigitalFuel to detect issues in real-time.

By the end of 2001, we were tracking over 100 Operational Level Agreements (OLAs), and a Culture of Quality had emerged at DoubleClick. Everyone—from engineers to executives—was aligned around business service metrics, and no one wanted to breach an SLA.

Lessons learned and the road ahead

Implementing a comprehensive SLM process at DoubleClick allowed us to:

Manage hundreds of contracts with up to five SLOs each.

Offer scalable SLAs that could adapt to new products.

Reduce financial risks by avoiding costly penalties.

Maintain our reputation by providing accurate and meaningful SLAs.

Detect breaches in real-time, allowing us to take proactive measures.

One of the biggest advantages was knowing in advance when an SLA was at risk. For example, we could predict that adding four minutes of downtime would breach 12 contracts and result in $X in penalties. This insight helped our Ops team act—pausing releases or preventing any changes that could impact uptime.

Some people dismiss SLAs, and in many cases, that skepticism is justified. Bad SLAs—those with unrealistic guarantees, no real penalties, or vague measurement criteria—undermine trust. I often see SLAs promising 0% packet loss, but when you ask how it’s measured, you quickly realize it’s meaningless. These kinds of SLAs give the entire concept a bad reputation.

However, when done right, SLAs are essential. They align customers and vendors, reduce friction, and eliminate blame games. That said, customers need to demand useful SLAs—not just ones that sound good on paper. The goal isn’t to drive vendors out of business but to hold them accountable. If they fail to deliver, they should feel the impact.

The evolution of SLAs

Back in 2001, we knew SLA management was critical, but could we have predicted how integral it would become in today’s cloud-driven world? SLAs have evolved from simple uptime guarantees to complex agreements that cover everything from latency to data residency. XLOs (Experience Level Objectives) are a thing—metrics that focus on the customer’s experience, not just the server’s performance. This shift in focus—from internal metrics to customer outcomes—is the future of performance management.

Stay tuned for Part 2, where we’ll explore how businesses can align their internal metrics with what truly matters: the customer’s experience

Learn more

New to SLAs, SLOs, and SLIs? Read this post to learn the fundamentals, best practices, and how they impact service reliability.

‍

What are SLAs?

At DoubleClick, we defined SLAs with the following principles in mind:

Attainable: The goals should be realistic.

Repeatable: The metrics should be consistently measurable.

Measurable: The performance should be quantifiable.

Meaningful: The metrics should matter to the business.

Mutually Acceptable: Both parties should agree on the terms.

Ground zero, discovery

From the beginning, we focused on end-user experience-based SLAs. This meant measuring performance from the user’s perspective, not just from the server’s perspective.

A universal challenge

SLA Application Performance

The SLM process

Risk simulations with data science: One of the most critical steps in our SLM process was using our in-house data scientists to run simulations. These simulations analyzed historical data from our monitoring tools to assess the risk of breaching SLAs. The goal was to set realistic SLAs that wouldn’t be breached every day, while still meeting customer expectations.

“What-if” scenarios: We also ran multiple “what-if” scenarios to understand the relationship between availability and revenue. These scenarios helped us evaluate the impact of downtime at different hours of the day and days of the week. For example, we could see how a 10-minute outage during peak traffic hours would affect revenue compared to the same outage during off-peak times.

The SLA desk: To streamline the process, we created an online tool in 2001—essentially an “SLA desk”—that allowed our sales team to request SLA portfolios for customers. These requests were reviewed and approved by our QoS team, ensuring that every SLA was realistic, measurable, and aligned with our capabilities.

Aligning external and internal SLAs

Monitoring – the key to SLA success

For monitoring, it became clear we needed to test as often as possible from as many vantage points as possible:

If you only check your SLO endpoints once an hour, you must wait 59 minutes between checks. That gap can lead to false downtime alerts.

You also need many data points to ensure statistical significance. Smaller datasets lower precision and power, while larger one’s help manage false positives and false negatives.

Enter Differential Performance Measurement (DPM)

The methodology added context to the measurements. DPM introduced clarity and comparison, removing absolute performance numbers from the SLAs.

Recipe for Differential Performance Measurement (example with an advert.):

Take two pages—one without ads and one with a single ad call.

Page A = No ads

Page B = One ad

Make sure the pages do not contain any other third-party references (CDNs, etc.).

Make sure the page sizes (in KB) are the same.

“Bake” – Measure response times for both pages and you get the following metrics:

Differential Response (DR) will be (Response Time of page B) minus (Response Time of page A)

Differential Response Percentage (DRP) = DR / A. (e.g. If Page A is 2 seconds, and Page B is 2.1 seconds, DR is 0.1 second, and DRP is 0.1/2=0.05 or 5%)

This approach helped eliminate noise caused by:

Internet-related issues beyond our control (e.g., fiber cuts).

Monitoring agent inconsistencies (raising the need to monitor our monitoring tools).

Other third-party dependencies.

To visualize the impact of Differential Performance Measurement (DPM), the chart below compares response times for two scenarios

Scenario 1: The ad-serving company experienced performance issues, which negatively impacted the customer’s site. The vendor breached the SLA threshold between Time 4 and Time 8.

Scenario 2: The website itself encountered performance problems, unrelated to the ad-serving company.

Reporting – Transparency and Accountability

Lessons learned and the road ahead

Implementing a comprehensive SLM process at DoubleClick allowed us to:

Manage hundreds of contracts with up to five SLOs each.

Offer scalable SLAs that could adapt to new products.

Reduce financial risks by avoiding costly penalties.

Maintain our reputation by providing accurate and meaningful SLAs.

Detect breaches in real-time, allowing us to take proactive measures.

The evolution of SLAs

Stay tuned for Part 2, where we’ll explore how businesses can align their internal metrics with what truly matters: the customer’s experience

Learn more

New to SLAs, SLOs, and SLIs? Read this post to learn the fundamentals, best practices, and how they impact service reliability.

‍

SLA Management

Customer Experience

Internet Resilience

Blog post

The Power of Over 3000 Intelligent Observability Agents

Blog post

Monitoring in the Age of Complexity: 5 Assumptions CIOs Need to Rethink

Blog post

The $1 Million Lesson: Building a Culture of Quality Through SLAs

in this blog post

What are SLAs?

Ground zero, discovery

A universal challenge

SLA Application Performance

The SLM process

Aligning external and internal SLAs

Monitoring – the key to SLA success

Enter Differential Performance Measurement (DPM)

Reporting – Transparency and Accountability

Lessons learned and the road ahead

The evolution of SLAs

Learn more

What are SLAs?

Ground zero, discovery

A universal challenge

SLA Application Performance

The SLM process

Aligning external and internal SLAs

Monitoring – the key to SLA success

Enter Differential Performance Measurement (DPM)

Reporting – Transparency and Accountability

Lessons learned and the road ahead

The evolution of SLAs

Learn more

You might also like

The Power of Over 3000 Intelligent Observability Agents

Monitoring in the Age of Complexity: 5 Assumptions CIOs Need to Rethink

Critical Requirements for Modern API Monitoring