As engineers, as we scale our applications and infrastructure, we accept that failure can and will happen. But, how can we get ahead of those potential failures? Gamedays are events which aim to test the resilience of a system when facing abnormal and turbulent situations, checking whether our expectations on how it will fail (or not) are correct.
In this session Ara Pulido, Technical Evangelist, will chat with Mike Petruzelli, reliability engineer on the Core Resilience team, and Elijah Andrews, software engineer on the Traffic team. We’ll discuss and show examples on how gamedays are organized at Datadog, particularly how the reliability engineers partner with teams across the organization to run larger events focused on general system failures impacting a big part of the system.
By the end of the session you will have a better understanding of what gamedays are and how you can start organizing them at your company.