Datadog on Caching

April 27, 2023

Ara Pulido

Ara Pulido

Jessica Cordonnier

Jessica Cordonnier

Mitch Ward

Mitch Ward

Category

Caching (and cache invalidation!) is often mentioned as one of the hardest problems in computer science. While caching can bring substantial performance improvements, reasoning about cached data can be extremely difficult as caching fundamentally means that you are no longer reading from your source of truth. With that in mind, many teams at Datadog needed to build distributed caches to scale their services and keep latency low.

As Datadog grew in size and complexity, teams designing and operating their own cache solutions started to become a bottleneck and added to the complexity. Based on that experience, a team was created to design, build and maintain a managed service for distributed in-memory caching, providing an easy way for over 2,000 engineers at Datadog to add fast caching to their system in a scalable, reliable, and consistent manner.

In this session, Ara Pulido, Staff Developer Advocate, will chat with Mitch Ward and Jessica Cordonnier, engineering managers on the Caching team at Datadog. They will explain how they used the learnings from prior cache implementations and distributed system principles to design the caching platform at Datadog. They will cover the various components that make up the platform, including the storage system, data structures, and scaling solutions.

By the end of the session you will understand caching systems better, their potential pitfalls and how to mitigate those, and how to run a cache infrastructure as an internal platform as a service. Unfortunately, we can't offer any help naming your internal caching platform; that's another difficult computer science problem for another time!