Agent Skill · dynatrace

dt-obs-problems

DAVIS problem analysis including root cause identification, impact assessment, and correlation with other telemetry. Use when querying or investigating detected problems. Trigger: "active problems", "root cause analysis", "problem impact", "affected users", "list problems", "P-12345 details", "recurring problems", "problem history", "problem trending", "blast radius", "which entity caused the problem", "problems affecting Kubernetes", "problems by service". Do NOT use for explaining existing queries, product documentation questions, generic log searching, distributed tracing, or host-level resource monitoring.

Provider: dynatrace Path in repo: skills/dt-obs-problems/SKILL.md

Skill body

Problem Analysis Skill

Analyze Dynatrace AI-detected problems including root cause identification, impact assessment, and correlation with logs and metrics.


Use Cases

1. Active Problem Triage

2. Root Cause Investigation


Overview

Dynatrace automatically detects anomalies, performance degradations, and failures across your environment, creating problems that aggregate related alert, warning and info-level events and provide root cause and impact insights.

What are Problems?

Problems are automatically detected, software and infrastructure health and resilience issues that:

Event Kinds

The event.kind field (stable, permission) identifies the high-level event type:

event.kind value Description
DAVIS_EVENT Davis-detected infrastructure/application events
BIZ_EVENT Business events (ingested via API or captured from spans)
RUM_EVENT Real User Monitoring events
AUDIT_EVENT Administrative/security audit events

event.provider (stable, permission) identifies the event source.

Problem Categories

Common event.category values:

Category Description Example
AVAILABILITY Infrastructure or service unavailable Web service returns no data, synthetic test actively fails, database connection lost
ERROR Increased error rates beyond baseline API error rate jumped from 0.1% to 15%
SLOWDOWN Performance degradation Response time increased from 200ms to 5000ms
RESOURCE Resource saturation Container memory at 95%, causing OOM kills
CUSTOM Custom anomaly detections Business KPI (orders/minute) dropped below threshold

Problem Lifecycle

Detection → ACTIVE → Under Investigation → CLOSED

Essential Fields

Common Field Name Mistakes

❌ WRONG ✅ CORRECT Description
title event.name Problem title/description
status event.status Problem lifecycle status
severity event.category Problem type/category
start event.start Problem start time

Correct Status Values

// ✅ CORRECT: Use these status values
fetch dt.davis.problems
| filter event.status == "ACTIVE"   // Currently occurring problems
//     or event.status == "CLOSED"  // Resolved problems
// ❌ INCORRECT: event.status == "OPEN" does not exist!
| limit 1

Key Fields Reference

fetch dt.davis.problems, from:now() - 1h
| filter not(dt.davis.is_duplicate)
| fields
    event.start,                          // Problem start timestamp
    event.end,                            // Problem end timestamp (if closed)
    display_id,                           // Human-readable problem ID (P-XXXXX)
    event.name,                           // Problem title
    event.description,                    // Detailed description
    event.category,                       // Problem type
    event.status,                         // ACTIVE or CLOSED
    dt.smartscape_source.id,              // The smartscape ID for the affected resource
    dt.davis.affected_users_count,        // Number of affected users
    smartscape.affected_entity.ids,        // Array of affected entity IDs
    dt.smartscape.service,                // Affected services (may be array)
    dt.davis.root_cause_entity,           // Entity identified as root cause
    root_cause_entity_id,                 // Root cause entity ID
    root_cause_entity_name,               // Human-readable root cause name
    dt.davis.is_duplicate,                // Whether duplicate detection
    dt.davis.is_rootcause                 // Root cause vs. symptom
| limit 10

Standard Query Pattern

Always start problem queries with this foundation:

fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields event.start, display_id, event.name, event.category
| sort event.start desc
| limit 20

Key components:

Common Query Patterns

Active Problems by Category

fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| summarize problem_count = count(), by: {event.category}
| sort problem_count desc

High-Impact Active Problems (affecting many users)

fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter dt.davis.affected_users_count > 100
| fields event.start, display_id, event.name, dt.davis.affected_users_count, event.category
| sort dt.davis.affected_users_count desc

High-Impact Active Problems (affecting many smartscape entities)

fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter arraySize(affected_entity_ids) > 5
| fields event.start, display_id, event.name, affected_entity_ids, event.category, impacted_entity_count = arraySize(affected_entity_ids)
| sort impacted_entity_count desc

Specific Problem Details

fetch dt.davis.problems
| filter display_id == "P-XXXXXXXXXX"
| fields event.start, event.end, event.name, event.description, affected_entity_ids, dt.davis.affected_users_count, root_cause_entity_id, root_cause_entity_name

Service-Specific Problem History

fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter in(dt.smartscape.service, toSmartscapeId("SERVICE-XXXXXXXXX"))
| summarize problems = count(), by: {event.category, event.status}

Root Cause Analysis Patterns

Basic Root Cause Query

fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields
    display_id,
    event.name,
    event.description,
    root_cause_entity_id,
    root_cause_entity_name,
    smartscape.affected_entity.ids

Root Cause by Entity Type

Identify which entity types most frequently cause problems:

fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| summarize problem_count = count(), by:{root_cause_entity_name}
| sort problem_count desc
| limit 20

Affected entity is an AWS resource

fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(arrayToString(smartscape.affected_entity.types, delimiter:","), "AWS_")

Infrastructure Root Cause with Service Impact

fetch dt.davis.problems, from:now() - 30m
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(root_cause_entity_id, "HOST-")
| filter isNotNull(dt.smartscape.service)
| fields display_id, event.name, root_cause_entity_name, dt.smartscape.service

Problem Blast Radius

Calculate entity impact per root cause:

fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| fieldsAdd affected_count = arraySize(smartscape.affected_entity.ids)
| summarize
    avg_affected = avg(affected_count),
    max_affected = max(affected_count),
    problem_count = count(),
    by:{root_cause_entity_name}
| sort avg_affected desc

Recurring Root Causes

Identify entities repeatedly causing problems:

fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| summarize
    problem_count = count(),
    first_occurrence = min(event.start),
    last_occurrence = max(event.start),
    by:{root_cause_entity_id, root_cause_entity_name}
| filter problem_count > 3
| sort problem_count desc

Cause Category vs. Root Cause Entity

These are different questions — pick the right approach:

Cause category breakdown (use when asked about common causes, patterns, or types):

fetch dt.davis.problems, from:now() - 30d
| filter not(dt.davis.is_duplicate)
| summarize problem_count = count(), by: {event.category}
| sort problem_count desc

Then for each category, explain what triggers it using the Problem Categories table and cite specific entities from the tenant data as examples.

Track problem trends over time, identify recurring issues, and analyze resolution performance.

Primary Files:

Common Use Cases:

Key Techniques:

See references/problem-trending.md for complete query patterns and best practices.

Cross-Domain Problem Queries

Problems Associated with Kubernetes Clusters

Use affected_entity_ids or dt.smartscape_source.id to find problems related to Kubernetes:

fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter matchesPhrase(dt.smartscape_source.id, "KUBERNETES_CLUSTER")
    OR matchesPhrase(dt.smartscape_source.id, "K8S_")
| fields event.start, display_id, event.name, event.category, event.status,
    dt.smartscape_source.id, affected_entity_ids
| sort event.start desc

Alternative: expand affected entities and filter for K8s entity types:

fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| expand entity_id = affected_entity_ids
| filter matchesPhrase(entity_id, "KUBERNETES_CLUSTER")
    OR matchesPhrase(entity_id, "K8S_")
| fields event.start, display_id, event.name, event.category, entity_id
| sort event.start desc

Simple Problem Listing

List all problems from the last 24 hours (common request):

fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate)
| fields event.start, event.end, display_id, event.name, event.category, event.status
| sort event.start desc

Response Construction

Problem Cause Summaries

When summarizing problem causes, categories, or patterns, provide a comprehensive breakdown across all standard categories present in the data: AVAILABILITY, ERROR, SLOWDOWN, RESOURCE, and CUSTOM. For each category found:

  1. Category name and count of problems
  2. What triggers it — brief explanation (e.g., RESOURCE = CPU/memory/disk threshold exceeded; AVAILABILITY = service or entity became unreachable)
  3. Specific examples from the tenant’s data (affected entity names, problem IDs)

Do not stop after the first two categories — users expect the full picture. Reference the Problem Categories table above for trigger descriptions.

Analysis Results

When presenting query results:

Best Practices

Essential Rules

  1. Always filter duplicates: Use not(dt.davis.is_duplicate) to avoid counting the same problem multiple times
  2. Use correct status values: "ACTIVE" or "CLOSED", never "OPEN"
  3. Specify time ranges: Always include time bounds to optimize performance
  4. Include display_id: Essential for problem identification and linking
  5. Test incrementally: Add one filter or field at a time when building queries
  6. Filter early: Apply not(dt.davis.is_duplicate) immediately after fetch

Query Development

Root Cause Verification

Time Range Guidelines

// ✅ GOOD - Specific time range
fetch dt.davis.problems, from:now() - 4h
// ❌ BAD - Scans all historical data
fetch dt.davis.problems

Absolute Timeframes Require Double Quotes

When using absolute ISO 8601 timestamps for from and to in DQL queries, always wrap them in double quotes. Unquoted timestamps are a syntax error.

// ✅ CORRECT - absolute timestamps quoted
fetch dt.davis.problems, from: "2026-05-18T22:50:00Z", to: "2026-05-18T23:35:00Z"
| filter not(dt.davis.is_duplicate)
| fields event.start, display_id, event.name, event.category, event.status
| sort event.start desc

Troubleshooting

Problem Cause Solution
No problems returned Using event.status == "OPEN" Use "ACTIVE" or "CLOSED""OPEN" does not exist
Duplicate problems in results Missing deduplication filter Add filter not(dt.davis.is_duplicate) immediately after fetch
Wrong field name (title, status, severity) SQL-like naming Use event.name, event.status, event.category — see field name table above
root_cause_entity_id is null Not all problems have identified root causes Add filter isNotNull(root_cause_entity_id) when querying root causes
Query scans too much data / times out Missing time range Always specify from:now() - <duration> on the fetch command
affected_entity_ids is empty array Problem has no mapped affected entities Check dt.smartscape.service or dt.smartscape_source.id as alternatives

When to Load References

Load problem-trending.md when:

Load problem-correlation.md when:

Load impact-analysis.md when:

References

Skill frontmatter

license: Apache-2.0