Agent Skill · dynatrace

dt-alerting

End-to-end Dynatrace alerting lifecycle — anomaly detector setup and model selection (static threshold, adaptive baseline, seasonal baseline), alert event storage in Grail, problem grouping and denoising by root cause analysis, and workflow-based notification routing (email, Slack, ServiceNow, webhook). Use when configuring alerting, choosing between detector types, querying alert event history, understanding why alerts merged into a problem, or setting up problem-triggered notifications.

Provider: dynatrace Path in repo: skills/dt-alerting/SKILL.md

Skill body

dt-alerting

Configure and understand the full alerting lifecycle in Dynatrace — from anomaly detector setup through Grail event storage, problem grouping, and workflow notification delivery.

The Alerting Lifecycle

┌─────────────────────────────────────────────────────────────────────┐
│  Alert Sources — five categories, each fires a DAVIS_EVENT          │
│  ─────────────────────────────────────────────────────────────────  │
│  1. DQL-based  · Grail-scheduled server-side detector               │
│  2. Edge       · OneAgent on monitored host or process              │
│  3. Pipeline   · OpenPipeline ingest-stream filter matcher          │
│  4. Synthetic  · Worldwide synthetic checker node                   │
│  5. External   · Events API, Workflow, or OneAgent local ingest     │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ DAVIS_EVENT created per trigger per entity
                               ▼
             ┌─────────────────────────────────────┐
             │  Event stored in Grail              │  Persisted and queryable via DQL.
             └──────────────────┬──────────────────┘  One event per trigger per entity.
                                │ correlated by root-cause and impact graph
                                ▼
             ┌─────────────────────────────────────┐
             │  Problem (Denoising)                │  Events sharing the same root-cause
             └──────────────────┬──────────────────┘  and impact graph → one Problem.
                                │ problem event triggers workflow
                                ▼
             ┌─────────────────────────────────────┐
             │  Workflow Notification              │  Filters problems and routes to
             └─────────────────────────────────────┘  email, Slack, ServiceNow, webhook.

When to Use This Skill

Agent Instructions

First step for any alerting setup request — Before recommending a specific detector or model, load references/anomaly-detectors.md and use its category and model decision guide to identify which detector category (DQL-based, Edge, Pipeline, Synthetic, External) and which model (Static, Adaptive, Seasonal) best fits the user’s use-case. Only proceed with configuration guidance once the right detector type has been established.

Consolidate, don’t multiply — When a user asks to alert on multiple entities of the same kind (e.g. “alert on services A, B, and C”), always recommend a single combined detector rather than one detector per entity. Use by: { <dimension> } in the DQL timeseries call to split results per entity, and a single filter: clause to scope to the relevant entities. Pair the combined detector with a single dt.alert_group tag shared across all alert conditions and the corresponding workflow notification filter. This keeps the number of detector configs small, ensures consistent routing, and makes the workflow notification channel reusable for future entities added to the same group.

Example for three services — one detector, one workflow:

timeseries avg(dt.service.request.response_time),
  by: { dt.smartscape.service },
  filter: { in(dt.smartscape.service, {toSmartscapeId("SERVICE-0000000000000001"), toSmartscapeId("SERVICE-0000000000000002"), toSmartscapeId("SERVICE-0000000000000003")}) }

Set dt.alert_group: "checkout-team" in the detector’s event properties, then filter the notification workflow on matchesPhrase(dt.alert_group, "checkout-team"). If a new service must be covered, add it to the single filter: list — no new detector or workflow rule needed.

Intent Mapping

User Request Action Reference
“how to alert on …”, “create an alert on …”, “create anomaly detector”, “set up alerting”, “configure alert rule” Explain detector categories and variants, guide through model selection anomaly-detectors.md
“what kinds of anomaly detectors”, “edge alert”, “pipeline alert”, “synthetic alert”, “OneAgent alert” Explain the five alert source categories and their trade-offs anomaly-detectors.md
“static vs adaptive”, “which detector model”, “seasonal detector” Compare models, apply decision guide anomaly-detectors.md
“query alert history”, “which alerts fired”, “Davis events in Grail” Query dt.davis.events in Grail via fetch dt.davis.events davis-events.md
“why did alerts merge”, “problem grouping”, “denoising” Do NOT explain merging rules here — load dt-obs-problems and refer to problem-merging.md for the full merge logic dt-obs-problems/references/problem-merging.md
“send Slack notification”, “email on problem”, “ServiceNow ticket”, “notify on alert” Explain problem-triggered workflow setup workflow-notifications.md
“alert storm”, “too many notifications”, “reduce noise” Filtering strategy, denoising, sensitivity tuning workflow-notifications.md + anomaly-detectors.md

Analyzing existing problems — If the user wants to query or investigate active/closed problems (root cause, impact, trending), load dt-obs-problems instead. This skill covers configuration and flow, not problem query analytics.

Detector health monitoring — If the user asks whether detectors are running or failing, load dt-platform (ANALYZER_EXECUTION_EVENT, ANOMALY_DETECTOR_STATUS_EVENT). This skill covers setup, not operational health.

Prerequisites

Knowledge Base Structure

# Reference Content
1 anomaly-detectors.md Detector types, model selection, configuration, best practices
2 davis-events.md Davis event storage in Grail, key fields, DQL query patterns
3 workflow-notifications.md Problem-triggered workflows, filtering, notification channels

Key Concepts

Alert Source Categories

Five fundamental categories of anomaly detectors, distinguished by where detection runs and how the alert event reaches Dynatrace:

# Category Detection runs on Latency Alert logic owner
1 DQL-based Grail (server-side, scheduled) Minutes Dynatrace
2 Edge OneAgent on the monitored host/process Seconds Dynatrace (OneAgent)
3 Pipeline OpenPipeline ingest path (in-stream) Near-zero Dynatrace (pipeline rule)
4 Synthetic Synthetic checker node (worldwide) Seconds Dynatrace (synthetic node)
5 External Customer / external tool Caller-defined Customer

See references/anomaly-detectors.md for the full breakdown of each category, including trade-offs and configuration entry points.

Detector Models at a Glance

Model Threshold Best for
Static Fixed value you define Known hard limits (e.g. error rate > 5%)
Adaptive baseline Learned from recent history Metrics with no fixed limit but clear normal behavior
Seasonal baseline Learned with time-of-day / day-of-week awareness Traffic, request rate, or any metric with recurring patterns

Davis Events vs. Problems

Concept Table Scope
Davis event fetch dt.davis.events One record per detector trigger per entity
Problem fetch dt.davis.problems One record per correlated group of events sharing root-cause and impact

A single problem typically contains multiple events. Querying problems gives the operational view; querying events gives the raw alert history.

Problem Denoising

For questions about why alerts merged into a problem or how Davis groups events, load dt-obs-problems — the merge logic and rules are documented in dt-obs-problems/references/problem-merging.md. This skill covers alert configuration and flow only.

Quick Start

Check What Alerts Fired in the Last 24 Hours

fetch dt.davis.events, from: -24h
| filter event.status == "ACTIVE"
| summarize alert_count = count(), by: {event.name, event.category, dt.smartscape_source.id}
| sort alert_count desc
| limit 20

Check Alert Volume by Category

fetch dt.davis.events, from: -24h
| summarize count = count(), by: {event.category, event.status}
| sort count desc

See All Active Problems (→ load dt-obs-problems for full query patterns)

fetch dt.davis.problems, from: -24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields event.start, display_id, event.name, event.category
| sort event.start desc
| limit 20

Best Practices

  1. Match the model to the metric’s behavior — Use static for hard SLO boundaries, adaptive for metrics without a natural fixed limit, seasonal for anything that follows business hours or weekly patterns.
  2. Scope detectors narrowly — An entity selector that covers only relevant entities reduces noise and makes problems more actionable.
  3. Tune sensitivity before going to production — Start with LOW sensitivity and move to MEDIUM or HIGH only after observing false-positive rates.
  4. Let Davis denoise before notifying — Trigger workflow notifications on problems, not individual alert events. A problem groups correlated alerts so you notify once per incident, not once per metric.
  5. Filter notifications by severity level — Route event.severity <= 2 problems to on-call channels immediately; route event.severity >= 3 problems to lower- urgency channels. Either set severity in the detector config or assign in a pipeline rule or workflow.
  6. Use dt.alert_group event property for routing — Assign dt.alert_group to route alerts to the right team. Either set a static value in the detector config, use dynamic assignment through DQL query result mapping or assign in a pipeline rule.
  7. Combine same-condition alerts into one detector and one workflow — When alerting on multiple entities with the same metric and threshold, merge them into a single DQL-based detector using by: { <dimension> } and a combined filter: clause. Assign the same dt.alert_group value to every condition in that detector and point the workflow notification channel at that single group. One detector + one workflow per logical alert group scales better than N detectors + N notification rules, and adding a new entity is a one-line filter change rather than a full detector/workflow addition.

Skill frontmatter

license: Apache-2.0