Datadog on Incident Management

August 27, 2020

Ara Pulido

Leo Cavaille

Matt Hardwick

Category

reliability →

Datadog is a monitoring and analytics platform that ingests trillions of data points per day, coming from more than 8,000 customers. With a complex distributed architecture and hundreds of deployments per day, needless to say sometimes things don't go as planned. Our teams have been improving the way incidents are managed at Datadog over the years and they are using that knowledge to help Datadog customers manage their own incidents.

In this session, Technical Evangelist Ara Pulido will chat with Léo Cavaillé, SRE Manager, and Matt Hardwick, an engineer working on Datadog’s incident application. They will discuss how incident management evolved at Datadog, how we handle incidents today, and how the SRE team is working alongside the engineers building Datadog’s Incident application to make Datadog the best place to organize, investigate, manage, and solve your infrastructure and application incidents.

By the end of the session you will have a better understanding of what chaos engineering is, how it can help your organization, and what you need to get started in your organization.

Episodes like this

Datadog on Datadog →

Datadog on Stateful Workloads on Kubernetes →

Datadog on Data Science →

Datadog on Kubernetes Autoscaling →

Datadog on Kubernetes Node Management →

Datadog on Caching →

Datadog on Site Reliability Engineering →

Datadog on Building an Event Storage System →

Datadog on Gamedays →

Datadog on Chaos Engineering →

Datadog on Kubernetes Monitoring →

Datadog on Kubernetes →