Datadog on Caching

April 27, 2023

Ara Pulido

Jessica Cordonnier

Mitch Ward

Category

distributed systems →

backend →

reliability →

performance →

Caching (and cache invalidation!) is often mentioned as one of the hardest problems in computer science. While caching can bring substantial performance improvements, reasoning about cached data can be extremely difficult as caching fundamentally means that you are no longer reading from your source of truth. With that in mind, many teams at Datadog needed to build distributed caches to scale their services and keep latency low.

As Datadog grew in size and complexity, teams designing and operating their own cache solutions started to become a bottleneck and added to the complexity. Based on that experience, a team was created to design, build and maintain a managed service for distributed in-memory caching, providing an easy way for over 2,000 engineers at Datadog to add fast caching to their system in a scalable, reliable, and consistent manner.

In this session, Ara Pulido, Staff Developer Advocate, will chat with Mitch Ward and Jessica Cordonnier, engineering managers on the Caching team at Datadog. They will explain how they used the learnings from prior cache implementations and distributed system principles to design the caching platform at Datadog. They will cover the various components that make up the platform, including the storage system, data structures, and scaling solutions.

By the end of the session you will understand caching systems better, their potential pitfalls and how to mitigate those, and how to run a cache infrastructure as an internal platform as a service. Unfortunately, we can't offer any help naming your internal caching platform; that's another difficult computer science problem for another time!

Episodes like this

Datadog on Datadog →

Datadog on Building Reliable Distributed Applications Using Temporal →

Datadog on OpenTelemetry →

Datadog on Secure Remote Updates →

Datadog on LLMs: From Chatbots to Autonomous Agents →

Datadog on Stateful Workloads on Kubernetes →

Datadog on Data Science →

Datadog on Kubernetes Autoscaling →

Datadog on Kubernetes Node Management →

Datadog On Maintaining eBPF at Scale →

Datadog on Data Engineering Pipelines: Apache Spark at Scale →

Datadog on Site Reliability Engineering →

Datadog on Building an Event Storage System →

Datadog on gRPC →

Datadog on Rust →

Datadog on Profiling in Production →

Datadog on Gamedays →

Datadog on Chaos Engineering →

Datadog on Agent Integration Development →

Datadog on eBPF →

Datadog on Serverless →

Datadog on Kubernetes Monitoring →

Datadog on Software Delivery →

Datadog on Incident Management →

Datadog on Kubernetes →