Datadog

Table of Contents

Introduction

I started using Datadog at Peloton in 2019. Prior to that, I had used a hodgepodge of observability tools including AWS CloudWatch, Loggly, and a bunch of internal Amazon tools (e.g. Timber). I continued to use Datadog at Better, Humane, and owned the architecture for deployment at GM. Since seeing Datadog for the first time, I have seen a dozen other tools - Observe, Coralogix, Chronosphere, Mezmo (LogDNA at the time), Grafana Labs, SigNoz, Splunk, Sentry, and a few homegrown ELK stacks. I cannot claim I have seen every tool, or even every popular tool, but I've gotten a very good breadth and a fair bit of depth in my career so far. I also cannot claim to have used every feature of Datadog, but I've definitely used the top 20 at multiple jobs. As another data point, I did pass two of the three available Datadog certifications at DASH in under 1 hour, combined.

Datadog is a cloud-hosted unified observability platform that offers tooling and integrations to collect signals or data from just about any source and index it into either a time series database or a search and analytics storage system. Datadog then provides tooling to visualize, extract automated or manual insights from, and monitor and alert on this data. Like many players in this space, you cannot locally host Datadog. The closest things you could self-host would be SigNoz or Grafana.

The obvious question is "Should I use Datadog?" and I tend to answer this question the same way I would if you said "Should I use AWS?" In both cases, the answer is "You probably should, but it is important to evaluate your specific needs." This is because like any PaaS or IaaS provider, Datadog works well for a large number of customers, but not every customer. I would estimate 85%-90% of companies could cost-effectively and efficiently integrate Datadog and be very pleased long-term. We will explore when you should do extra diligence in this guide.

If you get into the business of 3rd party observability, it won't take long for someone to tell you about how expensive Datadog is relative to other solutions in the space. I want to address this right up front so there's no confusion and proper expectations: Datadog CAN be expensive. This is due to a few different factors which I will talk about, and not even as a result of any nefarious motive on the part of Datadog. Much of this guide will keep costs in mind, and share suggestions for how to manage cost from day one, to avoid unwelcome surprises and leave a salty taste for the entirety of your relationship. With that in mind, let's get going!

Note: I will treat this guide somewhere between "you have never done this" and "you have a decent idea what you're doing" but maybe sprinkle in a few "expert-level" deep thoughts here and there. Note: I will mark each section that contains any AI-generated content, but expect that over time, there will be little to no AI-generated content.

Contract

You don't need a contract to use Datadog, you can just sign up at their website. They give you a 14-day trial to test things out, and you can just pop in a credit card for any overages and incidentals. But you will pay significantly more with this model. If you have any kind of significant volume, it will be wiser to sign a contract. The cost difference on the public website varies by product, but for example, 15-day log retention goes from $2.55 to $1.70 (a 33% discount) when you sign a contract. What does significant mean? If you were going to spend over $30,000 a month on Datadog without a contract, I think it's worth discussing a contract.

I've seen half a dozen Datadog contracts, which isn't comprehensive, but a good enough selection, to provide some useful information. I will dive into more detail about these elsewhere if some of the terms do not make sense.

Organizations

Datadog organizations allow you to separate users and data within your overall account. This is particularly useful for larger or divided companies that need to maintain separation between areas.

Generally, I recommend at least two organizations in an account. One as the payer account and one child organization. This is the same model you would see in AWS, where there is a payer/root account that manages the organization and many child accounts.

You may be tempted to split your production and non-production envrionments into separate accounts. I do not recommend this pattern at all, because it makes it much more complicated to evaluate signals across environments, and your teams may occasionally use production endpoints in non-production services and it will be harder to trace across this pathway. You can keep a separate account for testing, especially infrastructure-as-code testing, experiments, or when there are truly separate business units that do not intertwine at all.

Datadog runs many copies of itself across different clouds and regions. Each organization can be homed to a different cloud and region as appropriate. If you are a single cloud company (most are), be sure to pick the cloud and region closest to your actual running cloud infrastructure. If you are primarily an on-prem company, just choose the region closest to you. Another benefit of having a payer account and sub organizations is that you can create organizations homed to each account (as long as their data is disconnected).

Access

Datadog provides four mechanisms for access to the platform:

There is one additional point here - cross-organization data sharing - it is possible to grant access to some of your observability data from a different organization than the one that collects the data. This worked reasonably well when we tested it, but ultimately, we did not utilize this functionality.

Permissions

There are four different permission models in Datadog:

There are four different permission targets in Datadog for Restriction Policies: Datadog provides three built-in roles (Read, Standard, Admin) but allows you to create your own roles. I highly recommend building your own roles and never using Read or Standard, but your Admin users can just use Admin. A key reason is that when Datadog rolls out new features or permission structures, they will automatically modify the Read and Standard roles without you knowing. By making your own roles, you will explicitly manage when new permissions are given to your users. A few other notes:

Mobile App

Datadog has an app in the App Store (iOS) and on the Google Play Store (Android). I highly recommend you download them for many reasons.

There is a QR-code-based login method that works extremely well and saves you a lot of the headache of managing the login flows, especially for SSO. If you are just using Datadog auth, that works very well natively in the app, of course.

Usage Management (AI)

Datadog provides several tools to help you monitor and manage your usage of various products. In the Cost & Usage section of Datadog, you can view detailed breakdowns of your consumption for logs, metrics, monitors, and other services. Datadog also provides a few out-of-the-box dashboards that help with this as well:

But it's fair to say that you may want to customize these (you can), or unify them into a single pane of glass. You will also want to graph many of the metrics in the datadog.estimated_usage namespace (all of the above dashboards use these) as these will allow you to easily keep track of sudden spikes. That said, dashboards are not enough.

Setting up usage alerts is an absolute necessity to avoid unexpected charges. Using the metrics above, you can be alerted to any anomalous usage or gradual usage that breaches predefined thresholds via the Monitors product. You can also set up cost alerts here as well, which will be discussed more in the next section.

Cost Management (AI)

Datadog offers several tools to help you understand and manage your costs effectively. The Usage section provides granular breakdowns of your consumption across different products, allowing you to see exactly where your costs are coming from.

For a more comprehensive view, the Cloud Cost Management tool integrates with your cloud provider billing data to show you not just Datadog costs, but also the underlying infrastructure costs associated with your monitored resources. This holistic view helps you understand the total cost of ownership for your observability and monitoring setup.

To optimize costs, consider implementing cost allocation tags to track spending by team or project. Set up budget alerts in both the Usage section and Cloud Cost Management to receive notifications when your spending approaches predefined thresholds. Regularly review your monitor usage, log retention policies, and custom metric creation to identify areas where you can reduce costs without impacting observability.

Self-Hosted Alternatives

Self hosted alternatives coming soon!

Terraform (AI)

Datadog provides a Terraform provider that allows you to manage your Datadog configuration as code. This includes monitors, dashboards, service level objectives (SLOs), and other resources. Using Terraform for your Datadog configuration enables version control, peer review processes, and automation through CI/CD pipelines. It also helps maintain consistency across different environments.

As someone who has been evangelizing observability-as-code for years, I would propose this for a greenfield setup:

Integrations (AI)

Datadog offers extensive integration capabilities with major cloud platforms and services. Some of the key integrations include AWS, Azure, and Databricks.

The AWS integration is one of the most comprehensive, allowing you to collect metrics, logs, and traces from over 200 AWS services. It supports auto-discovery of resources, cloud security monitoring, and cost analysis. Setting up this integration typically involves creating an IAM role that Datadog can assume to collect data from your AWS account.

The Azure integration provides similar functionality for Microsoft's cloud platform. You can monitor Azure VMs, App Services, Functions, and many other services. The setup process involves creating an Active Directory application that grants Datadog the necessary permissions to access your Azure resources.

For data analytics workloads, the Databricks integration allows you to monitor your Spark clusters and jobs. You can track cluster performance, job execution times, and resource utilization metrics directly within Datadog.

When setting up integrations, it's important to consider the scope and permissions you're granting. Start with read-only access where possible, and carefully review which resources will be monitored to avoid unnecessary costs.

Infrastructure Hosts (AI)

Infrastructure hosts in Datadog represent the physical or virtual machines that are sending monitoring data. Each host has its own dashboard where you can view system metrics, logs, and performance information.

Proper tagging of hosts is essential for effective management. Use tags to categorize hosts by environment, team, application, or any other relevant dimension to make it easier to filter and monitor specific groups.

Logs (AI)

Datadog's log management allows you to collect, process, and analyze logs from all your applications and infrastructure. Logs can be searched, filtered, and visualized within the Datadog platform.

Efficient log management requires careful planning of log collection, retention policies, and indexing strategies. Consider using log processing pipelines to enrich logs with additional context or filter out unnecessary data before ingestion.

Metrics (AI)

Metrics in Datadog represent numerical data points that are collected over time. They form the foundation for monitoring and alerting on your systems' health and performance.

Custom metrics allow you to instrument your applications with business-specific measurements. However, be mindful of custom metric costs and only create metrics that provide meaningful insights for monitoring or alerting purposes.

Monitors & Synthetics (AI)

Monitors in Datadog allow you to set up alerts based on specific conditions or thresholds being met. You can create monitors for metrics, logs, traces, and other data types.

Synthetic monitoring enables you to simulate user interactions with your applications from different geographic locations. This helps you monitor availability and performance of critical user journeys proactively.

Real User Monitoring (RUM)

RUM coming soon!

Service Level Objectives (SLOs) (AI)

Service Level Objectives define the acceptable level of service reliability for your applications. Datadog allows you to create and track SLOs based on monitor status, error budgets, and time windows.

SLOs help teams balance reliability with the speed of feature delivery by providing clear targets for error budgets. They also facilitate data-driven discussions about service reliability during incident reviews.

Oncall (AI)

Datadog's oncall management features help you organize and alert the right people when incidents occur. You can configure escalation policies, schedules, and notifications to ensure proper coverage.

Effective oncall practices include clear documentation of procedures, rotation schedules that avoid burnout, and tools to help responders quickly understand the context of an incident.

Incident Management (AI)

Datadog provides tools to help manage incidents from detection through resolution. You can create incident timelines, track actions taken during an incident, and document postmortems.

A structured approach to incident management helps reduce mean time to resolution (MTTR) and improves the learning process after incidents. Consider implementing standard procedures for incident declaration, communication, and post-incident reviews.

Bits AI & Bits AI SRE

Bits coming soon!

MCP Server

MCP Server coming soon!

CI Visibility

CI Visibility coming soon!

Observability Pipelines

Observability Pipelines coming soon!

Workflows

Workflows coming soon!