February 22, 2023
There are many different ways to implement Site Reliability Engineering (SRE). From team structures to roles and responsibilities to planning and prioritization flows, there’s no golden path for how to organize things. As Datadog has shifted from a startup to a quickly-growing public company, we’ve seen our own SRE practice evolve. With over 22,000 customers sending trillions of data points each day, keeping Datadog reliable is critical to our business.
In this episode of Datadog on, join Staff Engineers Laura de Vesine and Rick Mangi to hear how Datadog’s approach to SRE has changed with scale and experience. Their unique backgrounds and roles – Rick is embedded on a team building an internal platform, while Laura works across multiple teams on a variety of projects – will highlight some of the different methodologies and how we use them.
You’ll learn how Datadog approaches technical debt and legacy systems, some key differences between SRE for startups versus larger companies, how to get buy-in for SRE practices at an organizational level, and more. Then you’ll have an opportunity to ask questions during live Q&A.
Datadog on Building Reliable Distributed Applications Using Temporal →
Datadog on OpenTelemetry →
Datadog on Secure Remote Updates →
Datadog on Stateful Workloads on Kubernetes →
Datadog on Data Science →
Datadog on Kubernetes Autoscaling →
Datadog on Kubernetes Node Management →
Datadog on Caching →
Datadog on Data Engineering Pipelines: Apache Spark at Scale →
Datadog on Building an Event Storage System →
Datadog on gRPC →
Datadog on Gamedays →
Datadog on Chaos Engineering →
Datadog on Serverless →
Datadog on Kubernetes Monitoring →
Datadog on Software Delivery →
Datadog on Incident Management →
Datadog on Kubernetes →