Datadog on Site Reliability Engineering

February 22, 2023

Brandon West

Laura de Vesine

Rick Mangi

Category

distributed systems →

reliability →

There are many different ways to implement Site Reliability Engineering (SRE). From team structures to roles and responsibilities to planning and prioritization flows, there’s no golden path for how to organize things. As Datadog has shifted from a startup to a quickly-growing public company, we’ve seen our own SRE practice evolve. With over 22,000 customers sending trillions of data points each day, keeping Datadog reliable is critical to our business.

In this episode of Datadog on, join Staff Engineers Laura de Vesine and Rick Mangi to hear how Datadog’s approach to SRE has changed with scale and experience. Their unique backgrounds and roles – Rick is embedded on a team building an internal platform, while Laura works across multiple teams on a variety of projects – will highlight some of the different methodologies and how we use them.

You’ll learn how Datadog approaches technical debt and legacy systems, some key differences between SRE for startups versus larger companies, how to get buy-in for SRE practices at an organizational level, and more. Then you’ll have an opportunity to ask questions during live Q&A.

The following category:

Datadog on Site Reliability Engineering

Category

Episodes like this