August 31, 2021
As engineers, as we scale our applications and infrastructure, we accept that failure can and will happen. But, how can we get ahead of those potential failures? Gamedays are events which aim to test the resilience of a system when facing abnormal and turbulent situations, checking whether our expectations on how it will fail (or not) are correct.
In this session Ara Pulido, Technical Evangelist, will chat with Mike Petruzelli, reliability engineer on the Core Resilience team, and Elijah Andrews, software engineer on the Traffic team. We’ll discuss and show examples on how gamedays are organized at Datadog, particularly how the reliability engineers partner with teams across the organization to run larger events focused on general system failures impacting a big part of the system.
By the end of the session you will have a better understanding of what gamedays are and how you can start organizing them at your company.
Datadog on Building Reliable Distributed Applications Using Temporal →
Datadog on OpenTelemetry →
Datadog on Secure Remote Updates →
Datadog on Stateful Workloads on Kubernetes →
Datadog on Data Science →
Datadog on Kubernetes Autoscaling →
Datadog on Kubernetes Node Management →
Datadog on Caching →
Datadog on Data Engineering Pipelines: Apache Spark at Scale →
Datadog on Site Reliability Engineering →
Datadog on Building an Event Storage System →
Datadog on gRPC →
Datadog on Chaos Engineering →
Datadog on Serverless →
Datadog on Kubernetes Monitoring →
Datadog on Software Delivery →
Datadog on Incident Management →
Datadog on Kubernetes →