Why We Built Catchpoint Orchestra

Orchestra is Catchpoint’s proprietary NoSQL database and the foundation of our platform. It is a time series database; however, we use it not only for constructing time series, but also for analyzing and aggregating synthetic test results and real user data. It does not provide merely averages and counts (nor does it process medians of averages or vice-versa, as these are unreliable indicators), but real statistics and percentiles at a highly granular level.

Orchestra supports both real time and non-real time (batch) data. The batch alerting it offers is built on top of the data, allowing you to trigger alerts based on the detailed metrics that have been collected. The alerting has a rich set of capabilities; not just did you go above or below this amount, but did you go above the three-hour average by X percent. Orchestra can also detect trend shifts, and is a highly specific monitoring tool, asking questions such as, ‘Did you fail three times consecutively?’ ‘Did you fail in X percentage of these locations?’ We like to call it a Swiss Army knife because of all the things it can do.

I’ve always been one of those types of engineer that likes to build things from scratch. In my previous position as chief architect of the research and development group at DoubleClick/Google, I had the opportunity to do that regularly.

One of the R&D ideas I played with at DoubleClick was mapping the relationship between code and database tables, a.k.a. object-relational mapping (ORM). There was a popular system that did this called Hibernate, however I wanted to enable customizability for it to do more than just work cleanly and easily, but also to be able optimize it to the coder’s needs.

Thus, I winded up creating an entirely new system, which served the DoubleClick R&D team well in terms of function and resources. (Note: Eventually, Microsoft came out with LINQ, and as much as I enjoy building things from scratch, I know when to move to a third party as anyone should – namely when it makes my life a whole lot easier.)

In the DoubleClick R&D team, alongside me were the other three co-founders of Catchpoint: Mehdi Daoudi, Veronica Ellis, and Dritan Suljoti. Frustrated with traditional monitoring systems, we decided to break off and start our own company.

The first five years of developing the platform was as a self-funded entity with the sole support of family and friends and what we sold. We knew from the outset that we needed to tackle problems differently from our competitors and offer value not by undercutting others, but through offering a more powerful feature set. Given my background and interest in system capacity modeling, which developed over ten years of growth at DoubleClick on their backend system, it made sense to build our own system to serve as the foundation of the Catchpoint platform.

What Orchestra Needed to Be

In 2008, NoSQL databases were a hot subject. The idea of using a database for storage and retrieval of data in a different way from the tabular relations used in relational databases was still very much in its infancy. Despite the concepts going back to the 1960s, NoSQL didn’t get real traction until the late 2000s.

I was close to the guys from MongoDB (in fact I almost ended up working there) and was closely following their movements when they released in 2009. Today, there are a bunch of NoSQL databases out there, most of which are generic. I liked the idea of making something super-flexible to deal with databases from scratch; it was an exciting time to be working out a new system and figuring out how to make it work.

From our experience at DoubleClick, we knew that a relational database would be way too costly, too difficult to maintain and too slow for our purposes. For Catchpoint, we needed something more flexible, which was time-based; specifically indexed towards the entities people would be using in our system. We knew we would be storing big data and had to make a set of estimates about how big the data would be.

We couldn’t guess that years later, Orchestra would be handling some of the industry’s largest monitoring data sets from companies like Google and LinkedIn, but we did know from the start that we needed something scalable.

“When you write something yourself from scratch, the power is all yours to optimize it to your business needs independently and interactively with the rest of your system.”

At DoubleClick, for example, we worked with other third-party processing systems, one of which was Ab Initio. It’s a great system with a nice UI that allows you to process data in a highly elegant way, which is hard not to admire. While working with it, I often thought about how they had made it easy to parse and filter from one stage to another (to sort, filter, transform, etc.). It was truly a beautiful system in many ways. However, very few people knew how to use it fully and there was a tremendous cost to scale it.

The VP of Engineering encouraged us to use it across our backend systems. We were spending seven figures for the system, but arguably needed eight to do it right. There were also a slew of exceptions that didn’t work fluidly. Ultimately, we found there was lots of special functionality that wasn’t documented and/or that needed extra support. I also had to build my own testing tools as there were few then available. We were constantly jumping through hoops, trying to work out better ways to do things and building new systems to sort, analyze, compare and review data.

When launching Catchpoint, building on experiences like this, it made sense to build our own system from the outset, and it wound up being one of the best moves we made. Using open source systems can be great, and you can do your own customizations; however, this can involve a lot of work at the onset, and more when any new version comes out, which may require re-adapting all over again, depending on the changes.

On the other hand, when you write something yourself from scratch, the power is all yours to optimize it to your business needs independently and interactively with the rest of your system. From the start and throughout, you write what is optimal for your specific system.

We also knew that we needed to make the serialization as cheap and efficient as possible (we didn’t have the cash to afford tons of machines). We also wanted to be able to merge data as it came in; whereby several years ago, we incorporated being able to do it in real-time. Back then, we knew we’d need to query the system fast, especially by recency. We wanted it to be flexible so that if changes needed to be made quickly, the system could handle it.

In essence, we had to store data efficiently and flexibly, with indexing for synthetic tests and real-user data for clients of the system. Instead of waiting for a new style of customizable database to come out that could do all those things, or use something that was very flexible but clearly non-optimal in terms of storage consumption, merge performance, and query speed, we realized it made sense to go ahead and do it ourselves.

In terms of language, we wanted to use a full programming language, as no scripting language would work as well. Until the DoubleClick R&D team, I was predominantly used to working in C++. Once that team was founded, I had the choice to move onto something else to help speed up our development, and although I was initially inclined to go with Java, after investigating C#, and being very comfortable in the Visual Studio suite of software development, it made total sense to switch over to C# from C++. After successfully working with C# for several years, there was little reason to switch to anything else when we started Catchpoint.

Originally, there were two stacks: the frontend and the backend. Veronica Ellis, one of the other co-founders, was focused on the frontend while I was the original backend lead. The backend at the time was comprised of three areas, namely Orchestra, Core Infrastructure, and Agent (Synthetic). Eventually, the Agent realm split off into its own backend team led by Dylan Greiner, while the Orchestra and Core Infrastructure remained and is led by Gavin Lynch. The Core Infrastructure essentially constitutes the backbone through which all big data flows, which are the Synthetic and Real-User monitoring (RUM) data.

What Makes Orchestra Unique?

Typically, you can make something that is rich in terms of feature set, but very expensive, or something that doesn’t have as strong a feature set yet is cheap. Orchestra combines the strengths of being both feature rich and inexpensive with the ability to scale.

There are two different sources of data: real user data and synthetic (or proactive) data. Both are necessary data sources when implementing a complete monitoring solution.

Orchestra handles both use cases, making Catchpoint the only monitoring company that is solving both kinds of data with the same solution. Most other companies do not have the depth and/or variety of our rich feature set.

For instance, Catchpoint is the only one with an outage predictor and estimator that is multivariable. Likewise, no one else has Trend Shift or Spotlight. Yes, some of these exist in companies with server-side infrastructure monitoring, but the big difference is they can’t take properly filter real-life fluctuations the way that Orchestra can. This, in addition to real-time data, means Catchpoint alerts have far fewer false positives.

Finally, with other companies you usually have to implement a third-party tool to slice and dice the data. With Catchpoint, Orchestra comes embedded with the tool to perform such analysis.

Learning Orchestra

When a new engineer starts at Catchpoint, there is a steep learning curve as they learn a new system. It usually takes several months for people to come to grips with Orchestra. It can even be hard to describe what we do in job postings since you can’t just put “MongoDB expert.” However, this typically means we attract employees who are self-motivated who want to learn and innovate. Instead of just looking for people with a certain kind of expertise, we search out great programmers who know data structure and algorithms, but who are equally open to newness and ultimately enjoy what they’re doing more than the tool they’re doing it with.

Updates to Orchestra

We have been updating Orchestra from the beginning. Three years ago, this picked up steam when we hired the aforementioned Gavin Lynch, who has since spearheaded an array of new updates to Orchestra, proving that it continues to be adaptable. Some of these updates include:

Parallelizing query processing: Instead of executing a query from start to finish on the same thread, parallelizing divides up the work to execute on multiple threads. As our UI pages have become more complicated, they are typically served by 5-6 bundled queries. Making work more efficient through parallelization has been especially helpful when querying data across a whole client’s division.
Parallelizing data merging: Historically, data merge speed was considered the main metric of capacity. Today, we are able to merge data in Orchestra 3-4x faster than before, and the greater capacity means we get a great deal more out of the hardware.
Overriding .NET: instead of relying purely on .NET to do the magic of handling memory allocation, garbage collection, and IO, we found areas of the code that could be made more memory and IO efficient by writing our own .NET replacements.

Additional updates performed at an operational level include:

APIs: Having APIs means that instead of our clients needing to query the system manually, or solely relying upon our automatically generated reports, those short-end-of-the-tail clients who have a heavy amount of usage and the time to do interesting things with their data can hit it as often as they like. They’re able to reap the rewards of doing so while not impacting the rest of the users of the Orchestra system interacting with it manually.

Some of the updates have come from our twice-yearly Hackathons, (see Our First Hackathon Projects Have Graduated to Production! and Why you should work at Catchpoint). These intensive week-long R&D sessions have led to a variety of new features for Orchestra, including:

Parallelized IO reads: An update upon parallelized queries to allow for multiple entity queries at the IO level
Overriding .NET libraries: One particular update was replacing .NET’s “IPAddress” with a replacement structure that reduces its memory consumption by about an order of magnitude.
Orchestra Testing Framework: A framework that is highly flexible for doing superior block-box testing in Orchestra

What’s Next?

The big issues in relation to scale and capacity have been overwhelmingly solved, and our continued updates have made Orchestra several orders of magnitude better than when we originally started. It’s hugely satisfying to see thousands of clients using and deriving value from it.

However, there are always more fun puzzles to play with and new tricks to be explored to make the system faster and better. There are a few code debt areas that need to be addressed, which we will allocate during sprints to see how they might be solved.

Another current limitation is connected to how we divide data (i.e. partition or shard). At the moment, you can’t cross multiple Orchestra partitions when you want to query. We are going to attack that problem shortly.

There are also some further memory issues we’re looking to solve. The amount of memory that a single instance takes up is huge and .NET (like all languages) has some challenges with allocating that much memory; we’re looking into it more to see if and how it can be solved.

We are also continually playing with the dashboard and the analysis module, which acts as a slice and dice interface to interact with data however the user sees fit. When data comes in from multiple locations, it might not always be possible to see the problem; we call this Simpson’s Paradox. We are adding a new feature to the dashboard, building on Correlation Metrics, which tells you how different metrics correlate with any change and ranks them in order of problem. If one of your pages is too slow, this feature will help you work out why it got too slow (was it the DNS, the connection time, etc.?), so you can swiftly find a solution.

At Catchpoint, we focus on monitoring everything, everywhere. Another area we’re working on is how to point our clients to the right places in the system with even more transparency. Our goal is to straightforwardly point out any problems we see and offer simple, clear guidance on how to fix them.

The more you can shepherd people to the right places, the more helpful it is. We want to build on those things we’ve already done in this regard, such as trend shifting alerts, Spotlight and SmartBoard. Finally, there is a team on the frontend, adding further business logic to help here as well.

Ultimately, we have plans to convert Orchestra into a plug-n-play styled database which will possess all the customizable optimization benefits that lead it to being the difference maker that it is, while allowing different kinds of time-based data to be stored and queried for any kind of big-data business need.

We’ve spent the last ten years constantly improving and developing Orchestra, making it a better and better mousetrap. In Field of Dreams, Kevin Costner’s character hears a voice exhorting him, “If you build it, he will come.” For me, it’s been more the case of, “If you build it, good things will come.” Ultimately, the decision to self-build Orchestra has paid huge dividends and is at the heart of maintaining Catchpoint’s performance.

We’re hiring! To learn more about the engineering team at Catchpoint, head over to our careers page.

August 8, 2019

Scotte Barkan

Synthetic Monitoring

Real User Monitoring

DNS

API Monitoring

SLA Management

Workforce Experience

Media and Entertainment

SaaS Application Monitoring

Blog post

The cost of inaction: A CIO’s primer on why investing in Internet Performance Monitoring can’t wait

Blog post

Mastering IPM: Key Takeaways from our Best Practices Series

Blog post

Why We Built Catchpoint Orchestra

What Orchestra Needed to Be

What Makes Orchestra Unique?

Learning Orchestra

Updates to Orchestra

What’s Next?

You might also like

The cost of inaction: A CIO’s primer on why investing in Internet Performance Monitoring can’t wait

Mastering IPM: Key Takeaways from our Best Practices Series

DNS Security: Fortifying the Core of Internet Infrastructure