October 10, 2023
Datadog, the observability platform used by thousands of companies, runs on dozens of self-managed Kubernetes clusters in a multi-cloud environment, adding up to tens of thousands of nodes, or hundreds of thousands of pods. This infrastructure is used by a wide variety of engineering teams at Datadog, with different feature and capacity needs.
How do we make sure that tens of thousands of nodes, with very different specifications and on different clouds are healthy, updated with the latest security patches, and running an updated version of the kubelet and container runtime, without breaking applications or interrupting more than a thousand engineers that rely on this infrastructure for their daily job?
In this session, Ara Pulido, Staff Developer Advocate, will chat with Adrien Trouillaud, Engineering Manager and David Benque, Staff Software Engineer, both part of the Compute team, about their strategies, lessons learned, and practical tips on how to successfully manage a huge fleet of Kubernetes nodes.
By the end of the session you will have a set of tips on how to prepare when scaling your Kubernetes clusters to hundreds or even tens of thousands of nodes.
Datadog on Building Reliable Distributed Applications Using Temporal →
Datadog on OpenTelemetry →
Datadog on Secure Remote Updates →
Datadog on LLMs: From Chatbots to Autonomous Agents →
Datadog on Stateful Workloads on Kubernetes →
Datadog on Data Science →
Datadog on Kubernetes Autoscaling →
Datadog On Maintaining eBPF at Scale →
Datadog on Caching →
Datadog on Data Engineering Pipelines: Apache Spark at Scale →
Datadog on Site Reliability Engineering →
Datadog on Building an Event Storage System →
Datadog on gRPC →
Datadog on Rust →
Datadog on Profiling in Production →
Datadog on Gamedays →
Datadog on Chaos Engineering →
Datadog on Agent Integration Development →
Datadog on eBPF →
Datadog on Serverless →
Datadog on Kubernetes Monitoring →
Datadog on Software Delivery →
Datadog on Incident Management →
Datadog on Kubernetes →