Table of Contents
- Introduction
- Contract
- Organizations
- Access
- Permissions
- Mobile App
- Usage Management
- Cost Management
- Self-Hosted Alternatives
- Terraform
- Integrations
- Infrastructure Hosts
- Logs
- Metrics
- Monitors & Synthetics
- Real User Monitoring (RUM)
- Service Level Objectives (SLOs)
- Oncall
- Incident Management
- Bits AI & Bits AI SRE
- MCP Server
- CI Visibility
- Observability Pipelines
- Workflows
Introduction
I started using Datadog at Peloton in 2019. Prior to that, I had used a hodgepodge of observability tools including AWS CloudWatch, Loggly, and a bunch of internal Amazon tools (e.g. Timber). I continued to use Datadog at Better, Humane, and owned the architecture for deployment at GM. Since seeing Datadog for the first time, I have seen a dozen other tools - Observe, Coralogix, Chronosphere, Mezmo (LogDNA at the time), Grafana Labs, SigNoz, Splunk, Sentry, and a few homegrown ELK stacks. I cannot claim I have seen every tool, or even every popular tool, but I've gotten a very good breadth and a fair bit of depth in my career so far. I also cannot claim to have used every feature of Datadog, but I've definitely used the top 20 at multiple jobs. As another data point, I did pass two of the three available Datadog certifications at DASH in under 1 hour, combined.
Datadog is a cloud-hosted unified observability platform that offers tooling and integrations to collect signals or data from just about any source and index it into either a time series database or a search and analytics storage system. Datadog then provides tooling to visualize, extract automated or manual insights from, and monitor and alert on this data. Like many players in this space, you cannot locally host Datadog. The closest things you could self-host would be SigNoz or Grafana.
The obvious question is "Should I use Datadog?" and I tend to answer this question the same way I would if you said "Should I use AWS?" In both cases, the answer is "You probably should, but it is important to evaluate your specific needs." This is because like any PaaS or IaaS provider, Datadog works well for a large number of customers, but not every customer. I would estimate 85%-90% of companies could cost-effectively and efficiently integrate Datadog and be very pleased long-term. We will explore when you should do extra diligence in this guide.
If you get into the business of 3rd party observability, it won't take long for someone to tell you about how expensive Datadog is relative to other solutions in the space. I want to address this right up front so there's no confusion and proper expectations: Datadog CAN be expensive. This is due to a few different factors which I will talk about, and not even as a result of any nefarious motive on the part of Datadog. Much of this guide will keep costs in mind, and share suggestions for how to manage cost from day one, to avoid unwelcome surprises and leave a salty taste for the entirety of your relationship. With that in mind, let's get going!
Note: I will treat this guide somewhere between "you have never done this" and "you have a decent idea what you're doing" but maybe sprinkle in a few "expert-level" deep thoughts here and there. Note: I will mark each section that contains any AI-generated content, but expect that over time, there will be little to no AI-generated content.
Contract
You don't need a contract to use Datadog, you can just sign up at their website. They give you a 14-day trial to test things out, and you can just pop in a credit card for any overages and incidentals. But you will pay significantly more with this model. If you have any kind of significant volume, it will be wiser to sign a contract. The cost difference on the public website varies by product, but for example, 15-day log retention goes from $2.55 to $1.70 (a 33% discount) when you sign a contract. What does significant mean? If you were going to spend over $30,000 a month on Datadog without a contract, I think it's worth discussing a contract.
I've seen half a dozen Datadog contracts, which isn't comprehensive, but a good enough selection, to provide some useful information. I will dive into more detail about these elsewhere if some of the terms do not make sense.
- Volume (Almost Always) Reduces Unit Cost - there are tiers, and when your usage breaks into a higher tier, your price per unit goes down. Don't be penny-wise and pound foolish here. Say, for example, the threshold for an additional 10% off on 15-day log ingestion is 100 million logs (I don't know the exact number). Don't commit to anything between 88 million and 99 million logs, just say 100 million, same price, more logs. One big thing that never(?) gets discounted is log ingest - I've never seen it under $0.10 / GB
- Drawdown Billing - after a certain amount, the contract will stop being product-based and move to dollar-based. This means that while you are using your product usage to guide your commit, the commit goes into one big bucket, and you can use more or less of a product as needed. The one caveat here is Audit Trail - once it's on, it's on, and turning it off does not return that money to the pool.
- Premier Support - some years ago, Datadog started baking in premier support to contracts. While you may have some negotiating room on the percentage, expect to pay for support. Based on the size of the contract, this may include a Technical Account Manager (TAM), Solution Architect (SA), and/or Account Manager (AM), but always includes instant chat support, as well as email and web support with fairly good SLAs.
- Product Bundling - some products, like Data Jobs Monitoring (DJM), APM/USM Hosts, Infrastructure Hosts will bundle in allotments of other products, and vary those based on the tier of product you select. For example, a Pro-tier Infrastructure Host includes 100 Custom Metrics for free, but an Enterprise Infrastructure Host includes 200. You have to look at your specific usage to understand what makes the most sense.
- Log Volume Measurement - log volume at Datadog is NOT measured in bytes for indexing, only ingestion. Long log lines at Datadog are truncated at 4 MB uncompressed. This means that if you emit lots of tiny log lines, you're going to pay more than if you constructed fewer but more information dense log messages. It also makes it hard to compare pricing between Datadog and other providers, most of which use bytes as the sole measure.
- 99th Percentile Billing - most costs that are billed as a monthly unit operate on 99th-percentile billing. Roughly speaking, this means that each hour, Datadog counts how many units you have of a particular product. At the end of the month, Datadog creates a numerically sorted list, and chops off the highest 7 values (720 hours in most months, 1% rounds to 7 hours), and bills for the 8th value. So you can have a truckload of instances and metrics for 7 hours a month (load testing anyone?) but that won't count if you shut it all down before the 8th hour.
- Multiyear Discounts - always ask about multi-year discounts. Like any provider, Datadog saves money if they can renegotiate contracts every two years instead of every year, and it provides them (and you) stability in revenue (and cost).
Organizations
Datadog organizations allow you to separate users and data within your overall account. This is particularly useful for larger or divided companies that need to maintain separation between areas.
Generally, I recommend at least two organizations in an account. One as the payer account and one child organization. This is the same model you would see in AWS, where there is a payer/root account that manages the organization and many child accounts.
You may be tempted to split your production and non-production envrionments into separate accounts. I do not recommend this pattern at all, because it makes it much more complicated to evaluate signals across environments, and your teams may occasionally use production endpoints in non-production services and it will be harder to trace across this pathway. You can keep a separate account for testing, especially infrastructure-as-code testing, experiments, or when there are truly separate business units that do not intertwine at all.
Datadog runs many copies of itself across different clouds and regions. Each organization can be homed to a different cloud and region as appropriate. If you are a single cloud company (most are), be sure to pick the cloud and region closest to your actual running cloud infrastructure. If you are primarily an on-prem company, just choose the region closest to you. Another benefit of having a payer account and sub organizations is that you can create organizations homed to each account (as long as their data is disconnected).
Access
Datadog provides four mechanisms for access to the platform:
- Username/Password - the least amount of control, but the easiest to set up, just send an invite email and assign a role. When you start, Datadog will start with a single username/password user. If you have SSO available, I recommend you set up a distribution group with a few people, like datadog@yourdomain.com, and use that as the starting email and keep the password to this account in a safe place accessible to the group members (1Password, Vault, Key Vault, etc.). I also recommend enforcing MFA and either keeping the QR code in a shared place, or using a shared MFA generator.
- Google - a tight integration exists between Google sign in and Datadog. Even though this is basically SSO using OAuth2
- Generic SSO - if you use Okta, OneLogin, Cognito, Entra, or similar, this is the best way to manage this. A few key callouts when setting this up:
- Make sure you select the correct default role
- If you have robust attribute management in your IdP, you can leverage mappings to assign users to roles and teams
- Datadog separates the concept of login and email, although they look the same, ensure you are using the desired IdP attribute from the get-go (as changing the login generates a new user)
- If you do not have or do not want to use a central IdP homepage (where all your SSO apps are listed), Datadog will instead provide a direct link you can use to immediately log in. This is especially important if you are not using the primary US1 instance of Datadog (in AWS).
- SSO user provisioning is generally JIT (just-in-time) meaning the first time you use the SSO link, a user will be made if one is needed. This is the best way to create users.
- Override the single admin to allow password login and set the password login default to off, forcing people to use SSO only except for the one admin.
- API - Datadog has a robust API (which is used by Terraform as well) to enable programmatic access to Datadog resources. Somewhat uniquely, Datadog uses two different credentials for this purpose, an API key and an Application key. In your mind, think of an API key as an Account key, used to identify the account, and ingest observability signals. Think of the Application key as the Resource Management key, used to create resources within Datadog as well as send observability data in a programmatic fashion. You always need both keys.
Permissions
There are four different permission models in Datadog:
- Role Access - users are assigned roles (without one, they have no permissions), and roles have permissions to an entire class of entity (e.g. Read Users, Write Users, Read Dashboards, Write Dashboards). No resource level permissions can be assigned here.
- Restriction Policies - many, and a growing list, of resources can have restriction policies applied. These granular policies can target specific permissions to very specific groups (offering a resource-specific deny policy). There are precursors for this called "restricted_roles" on some resources (like dashboards and monitors) that should no longer be used.
- Logs Data Access Restriction Queries - logs can be restricted to only be visible to, and their processing rules restricted to certain roles and teams
- Organization Data Access Controls - logs, APM, and RUM data can be restricted to only be visible to, and their processing rules restricted to certain roles and teams
- User - all users in Datadog have one or more Roles (and inherit the additive permissions of that role). There is no Deny ability in Datadog roles, but you can manage specific resource permissions (viewer and editor are the most common) using Restriction Policies to grant individual users access to specific resources even though many users have broad permissions through their role
- Role - the most common permission target, a role is configured with a set of permissions to specific functions (not specific resources) and any user assigned this role will have all of the attached permissions
- Team - a user can belong to one or more teams, and restriction policies can target these teams. A common use case might be that all users can read all dashboards, but team members can edit the dashboards belonging to the team.
- Organization - in the event your permission strategy has given all users write access to dashboards, to restrict access to edit a particular dashboard, you must first declare that all organization members are only viewers of this dashboard, and then declare the editor permission for the specific role(s), team(s), or user(s).
- Least Privilege - start with the minimum number of permissions required for your business. No need to allow read access to features that you are not going to use, or if used would incur costs you are not willing to pay for. Users will be less tempted to start shipping data they will never see (although you should still be vigilant about checking Cost and Usage).
- Lockout Prevention - Datadog has an optional mechanism (enabled by default) in place to prevent locking out Admins. If you are sure you know what you're doing, you can allow Admins to lock themselves out of certain resource access.
- Limited Options - Datadog only has the concept of read or edit/manage for most resources (with "run" being the rarely-used 3rd-most common), which encompasses created, edit, and delete functions
- Built-in Profile Edit - Every user can edit their profile, which includes their name and email address. You cannot stop this. This can potentially break observability-as-code functionality if you look up users by email. Tell people not to touch.
Mobile App
Datadog has an app in the App Store (iOS) and on the Google Play Store (Android). I highly recommend you download them for many reasons.
- Paging will work a lot better, allowing app notifications as well as configuring overrides for calls and texts
- Instant access to all the common Datadog resources (dashboards, metrics, logs, etc) right on your phone or tablet
- Deep integrations work seamlessly while on your mobile device without complex authentication flows
- If you're using Bits, it will work directly from your phone, which is awesome
Usage Management (AI)
Datadog provides several tools to help you monitor and manage your usage of various products. In the Cost & Usage section of Datadog, you can view detailed breakdowns of your consumption for logs, metrics, monitors, and other services. Datadog also provides a few out-of-the-box dashboards that help with this as well:
- Log Management
- APM Trace Estimated Usage
- Datadog Cost Overview and Changes
Setting up usage alerts is an absolute necessity to avoid unexpected charges. Using the metrics above, you can be alerted to any anomalous usage or gradual usage that breaches predefined thresholds via the Monitors product. You can also set up cost alerts here as well, which will be discussed more in the next section.
Cost Management (AI)
Datadog offers several tools to help you understand and manage your costs effectively. The Usage section provides granular breakdowns of your consumption across different products, allowing you to see exactly where your costs are coming from.
For a more comprehensive view, the Cloud Cost Management tool integrates with your cloud provider billing data to show you not just Datadog costs, but also the underlying infrastructure costs associated with your monitored resources. This holistic view helps you understand the total cost of ownership for your observability and monitoring setup.
To optimize costs, consider implementing cost allocation tags to track spending by team or project. Set up budget alerts in both the Usage section and Cloud Cost Management to receive notifications when your spending approaches predefined thresholds. Regularly review your monitor usage, log retention policies, and custom metric creation to identify areas where you can reduce costs without impacting observability.
Self-Hosted Alternatives
Self hosted alternatives coming soon!
Terraform (AI)
Datadog provides a Terraform provider that allows you to manage your Datadog configuration as code. This includes monitors, dashboards, service level objectives (SLOs), and other resources. Using Terraform for your Datadog configuration enables version control, peer review processes, and automation through CI/CD pipelines. It also helps maintain consistency across different environments.
As someone who has been evangelizing observability-as-code for years, I would propose this for a greenfield setup:
- The first admin creates:
- an API key named Terraform
- a service account named Terraform (with a distribution email of terraform@yourdomain.com) with the Admin role (which you can filter most of to trash, and can be an alias of the admin). If you really want, you could craft a new role, called Terraform (by hand, chicken and egg problem) that has the specific permissions you want.
- an application key named Terraform owned by the service account. You might think to scope this key, but scoped keys can introduce issues in certain obscure cases.
-
Terraform can now be used to:
- Import the role and service account you created earlier (API key and application key imports will probably be deprecated by the time you read this)
- Create a read role with all the read permissions you want to the products and services you will utilize in Datadog
- Create one or more roles with the specific edit permissions you want
Integrations (AI)
Datadog offers extensive integration capabilities with major cloud platforms and services. Some of the key integrations include AWS, Azure, and Databricks.
The AWS integration is one of the most comprehensive, allowing you to collect metrics, logs, and traces from over 200 AWS services. It supports auto-discovery of resources, cloud security monitoring, and cost analysis. Setting up this integration typically involves creating an IAM role that Datadog can assume to collect data from your AWS account.
The Azure integration provides similar functionality for Microsoft's cloud platform. You can monitor Azure VMs, App Services, Functions, and many other services. The setup process involves creating an Active Directory application that grants Datadog the necessary permissions to access your Azure resources.
For data analytics workloads, the Databricks integration allows you to monitor your Spark clusters and jobs. You can track cluster performance, job execution times, and resource utilization metrics directly within Datadog.
When setting up integrations, it's important to consider the scope and permissions you're granting. Start with read-only access where possible, and carefully review which resources will be monitored to avoid unnecessary costs.
Infrastructure Hosts (AI)
Infrastructure hosts in Datadog represent the physical or virtual machines that are sending monitoring data. Each host has its own dashboard where you can view system metrics, logs, and performance information.
Proper tagging of hosts is essential for effective management. Use tags to categorize hosts by environment, team, application, or any other relevant dimension to make it easier to filter and monitor specific groups.
Logs (AI)
Datadog's log management allows you to collect, process, and analyze logs from all your applications and infrastructure. Logs can be searched, filtered, and visualized within the Datadog platform.
Efficient log management requires careful planning of log collection, retention policies, and indexing strategies. Consider using log processing pipelines to enrich logs with additional context or filter out unnecessary data before ingestion.
Metrics (AI)
Metrics in Datadog represent numerical data points that are collected over time. They form the foundation for monitoring and alerting on your systems' health and performance.
Custom metrics allow you to instrument your applications with business-specific measurements. However, be mindful of custom metric costs and only create metrics that provide meaningful insights for monitoring or alerting purposes.
Monitors & Synthetics (AI)
Monitors in Datadog allow you to set up alerts based on specific conditions or thresholds being met. You can create monitors for metrics, logs, traces, and other data types.
Synthetic monitoring enables you to simulate user interactions with your applications from different geographic locations. This helps you monitor availability and performance of critical user journeys proactively.
Real User Monitoring (RUM)
RUM coming soon!
Service Level Objectives (SLOs) (AI)
Service Level Objectives define the acceptable level of service reliability for your applications. Datadog allows you to create and track SLOs based on monitor status, error budgets, and time windows.
SLOs help teams balance reliability with the speed of feature delivery by providing clear targets for error budgets. They also facilitate data-driven discussions about service reliability during incident reviews.
Oncall (AI)
Datadog's oncall management features help you organize and alert the right people when incidents occur. You can configure escalation policies, schedules, and notifications to ensure proper coverage.
Effective oncall practices include clear documentation of procedures, rotation schedules that avoid burnout, and tools to help responders quickly understand the context of an incident.
Incident Management (AI)
Datadog provides tools to help manage incidents from detection through resolution. You can create incident timelines, track actions taken during an incident, and document postmortems.
A structured approach to incident management helps reduce mean time to resolution (MTTR) and improves the learning process after incidents. Consider implementing standard procedures for incident declaration, communication, and post-incident reviews.
Bits AI & Bits AI SRE
Bits coming soon!
MCP Server
MCP Server coming soon!
CI Visibility
CI Visibility coming soon!
Observability Pipelines
Observability Pipelines coming soon!
Workflows
Workflows coming soon!