Datadog on Kubernetes Node Management
October 10, 2023
Adrien Trouillaud, David Benque and Ara Pulido
Datadog, the observability platform used by thousands of companies, runs on dozens of self-managed Kubernetes clusters in a multi-cloud environment, adding up to tens of thousands of nodes, or hundreds of thousands of pods. This infrastructure is used by a wide variety of engineering teams at Datadog, with different feature and capacit...
Datadog On Maintaining eBPF at Scale
September 27, 2023
Valeri Pliskin, Guy Arbitman and Andrew Krug
The extended Berkeley Packet Filter, eBPF has resulted in an ecosystem of new tooling that allows running programs in the linux kernel without loading kernel modules. eBPF seeks to do this safety within a secure sandbox environment and has been a boon to observability and security.
Datadog on Mobile Software Development
August 22, 2023
Xavier Gouchet, Maciek Grzybowski and Ara Pulido
Understanding the health and user experience of your mobile application is critical in order to avoid user frustration, understand application crashes, and reduce bugs mean time to resolution. To help with that task, Datadog has a mobile monitoring solution that allows developers to better understand and improve their application. But ...
Datadog on WebRTC
May 31, 2023
Brandon West, Jason Thomas and Brad Carter
WebRTC is a standard for real-time digital communications by enabling video, audio, and data streaming. Originally created for web browsers, Datadog uses WebRTC to create streaming applications across a variety of platforms, from Electron-based native applications to mobile applications.
Datadog on Caching
April 27, 2023
Jessica Cordonnier, Mitch Ward and Ara Pulido
Caching (and cache invalidation!) is often mentioned as one of the hardest problems in computer science. While caching can bring substantial performance improvements, reasoning about cached data can be extremely difficult as caching fundamentally means that you are no longer reading from your source of truth. With that in mind, many te...
Datadog on Data Engineering Pipelines: Apache Spark at Scale
March 23, 2023
Alodie Boissonnet, Anton Ippolitov and Ara Pulido
Datadog is an observability and security platform that ingests and processes tens of trillions of data points per day, coming from more than 22,000 customers. Processing that amount of data in a reasonable time stretches the limits of well known data engines like Apache Spark.
Datadog on Site Reliability Engineering
February 22, 2023
Brandon West, Laura de Vesine and Rick Mangi
There are many different ways to implement Site Reliability Engineering (SRE). From team structures to roles and responsibilities to planning and prioritization flows, there’s no golden path for how to organize things. As Datadog has shifted from a startup to a quickly-growing public company, we’ve seen our own SRE practice evolve. Wit...
Datadog on the Lifecycle of Threats and Vulnerabilities
January 12, 2023
Nick Frichette, Adam Stevko and Andrew Krug
The security industry is full of complex terminology like threat, vulnerability, and mitigations. Definitions matter as we design processes that scale. At Datadog, the Security Research functions are focused on detection and response to specific types of threats and vulnerabilities. Workload vulnerabilities, cloud control plane vuln...
Datadog on Building an Event Storage System
December 13, 2022
Ara Pulido, Guillaume Duranceau and Ryan Worl
When Datadog introduced its Log Management product, it required a new event data storage platform, as storing logs and events is a completely different problem from storing metrics, which was the first Datadog product.
Datadog on gRPC
September 29, 2022
Ara Pulido, Anthonin Bonnefoy and Antoine Tollenaere
Datadog, the observability platform used by thousands of companies, is made up of hundreds of services that communicate over the network using gRPC, an RPC framework, making it a critical component for Datadog’s reliability.
Datadog on Data Informed Product Development
July 26, 2022
Ara Pulido, Miranda Kapin and Derek Howles
Datadog is an observability and security platform. That means that our users may be in a high stress situation: debugging an issue in production, managing an incident or responding to a security threat. Having a good UX is particularly critical in those cases.
Datadog on Detecting Threats using Network Traffic Flows
June 11, 2022
Theo Guidoux, Andrew Krug and Anna Pauxberger
At Datadog’s scale, with over 18,000 customers sending trillions of data points per day, analyzing the volume of data coming in can be challenging. One of the largest log sources internally at Datadog are networking logs. Being able to analyze and make sense of them is critical to keep Datadog secure. To help with the task, we have bui...
Datadog on Web Security Standards
May 19, 2022
Jean-Baptiste Aviat, Ayaz Badouraly and Andrew Krug
Datadog on Rust
February 23, 2022
Duarte Nunes, Ara Pulido and Brian Troutwine
Rust is a programming language that has been gaining popularity over the past few years, with its adopters claiming that it helps them write faster, memory efficient, and more reliable software.
Datadog on Profiling in Production
January 28, 2022
Julien Danjou and Kirk Kaiser
Depending on your chosen programming language and stack, you may have never used a profiler in production. The very idea of using a profiler in production for a web service may seem unrealistic, due to the amount of overhead involved. After all, aren’t profilers extremely computationally expensive to run?
Datadog on Data Visualization
December 14, 2021
Mark Hintz, Ara Pulido and Kemper Smith
Datadog customers send trillions of data points per day. These data points are processed by Datadog and used to debug production issues in real time. But, in order to reason about all this data, we humans need visual representations. Visualizations can help us discover connections and problem points.
Datadog on Building Responsive UX
September 30, 2021
Amy Luo, Edwin Morris and Ara Pulido
Datadog product designers and frontend developers have been working together to create a new, better UX for creating dashboards, which is one of the most important parts of using Datadog. A central part of this effort was building a new layout engine. Working on this project was a bit different from the usual feature work, so the colla...
Datadog on Gamedays
August 31, 2021
Elijah Andrews, Mike Petruzelli and Ara Pulido
As engineers, as we scale our applications and infrastructure, we accept that failure can and will happen. But, how can we get ahead of those potential failures? Gamedays are events which aim to test the resilience of a system when facing abnormal and turbulent situations, checking whether our expectations on how it will fail (or not) ...
Datadog on Chaos Engineering
June 1, 2021
Joris Bonnefoy, Tay Nishimura and Ara Pulido
As you scale your applications, remaining resilient to underlying network failures, resource constraints introduced by other applications, or spikes in traffic can become exponentially more complex, even with very thorough testing and processes. Chaos engineering is a discipline that encourages experimenting in production and injecting...
Datadog on Security and Compliance
March 31, 2021
Kirk Kaiser and Andrew Spangler
At Datadog, customer trust and data security are of the utmost importance.
Datadog on Agent Integration Development
March 23, 2021
Christine Chen, Ara Pulido and Julia Simon
To make sure that customers are getting the most out of the platform in the least amount of time, Datadog maintains more than 400 built-in integrations. These integrations collect metrics, events, and logs from a diverse set of sources: databases, source control, bug tracking tools, cloud providers, automation tools, and more.
Datadog on eBPF
January 26, 2021
Lee Avital, Guillaume Fournier and Ara Pulido
eBPF (extended Berkeley Packet Filter) is a Linux technology that can run sandboxed programs in the kernel without changing kernel source code or loading kernel modules. While the kernel is an ideal place to implement monitoring/observability, networking, and security it wasn't until the recent broad adoption of eBPF that it was feasib...
Datadog on Serverless
December 10, 2020
David Huie, Kirk Kaiser and Andrew Krug
The Datadog Security Platform team leverages Serverless to ingest security events across many different cloud providers, deployment platforms, and devices. These security events are then transformed and shipped to a data lake to help defend and protect the platform as a whole. Once there, these ingested events are used to drive interna...
Datadog on Kubernetes Monitoring
November 16, 2020
Celene Chang, Charly Fontaine and Ara Pulido
With many blog posts published and talks given on the topic, it’s no secret that Datadog is running Kubernetes at scale. We currently run dozens of clusters, some of them with thousands of nodes. Additionally, we have clusters running in multiple clouds. How are we monitoring all of that, ensuring we can scale up quickly and safely?
Datadog on Software Delivery
September 30, 2020
Jacob LeGrone, Ara Pulido and Benjamin Smith
Over 800 Engineers at Datadog do thousands of deployments per day, to hundreds of services in different environments, regions, and cloud providers. How can we manage all those deployments in a common way and have a reliable paper trail way to audit any changes?
Datadog on Incident Management
August 27, 2020
Leo Cavaille, Matt Hardwick and Ara Pulido
Datadog is a monitoring and analytics platform that ingests trillions of data points per day, coming from more than 8,000 customers. With a complex distributed architecture and hundreds of deployments per day, needless to say sometimes things don't go as planned. Our teams have been improving the way incidents are managed at Datadog ov...
Datadog on RocksDB
June 30, 2020
James Bibby, Kenny House and Ara Pulido
Datadog is a monitoring and analytics platform that ingests trillions of data points per day, coming from more than 8,000 customers. Each of those is associated with metadata, mostly in the form of tags, and it can also be part of streams of related data points, which can then be explored, queried, or aggregated. RocksDB is used by man...
Datadog on Kafka
May 27, 2020
Jamie Alquiza, Kirk Kaiser and Balthazar Rouberol
In this session, we’ll speak with two engineers responsible for scaling the Kafka infrastructure within Datadog, Balthazar Rouberol and Jamie Alquiza. They'll share their strategy in scaling Kafka, how it’s been deployed on Kubernetes, and introduce kafka-kit; our open source toolkit for scaling Kafka clusters.
Datadog on Kubernetes
May 27, 2020
Laurent Bernaille and Ara Pulido
When 2 years ago Datadog decided to move its infrastructure platform to Kubernetes we didn’t expect to find so many roadblocks, but ingesting trillions of datapoints per day in a reliable fashion requires pushing the limits of cloud computing.