Agent Skill · dynatrace

dt-obs-kubernetes

Kubernetes cluster, pod, node, and workload monitoring. Use when analyzing K8s health, resource optimization, pod failures, OOMKills, scheduling, or security posture. Also use for Kubernetes operational events like pod restarts, OOM events, evictions, and cluster event history. Trigger: "Kubernetes pods", "K8s cluster health", "OOMKill", "pod restarts", "container CPU", "namespace resource usage", "over-provisioned pods", "privileged containers", "pod placement", "K8s node capacity", "running containers by cluster", "workload scheduling", "pod evictions", "K8s labels and annotations", "kubernetes events", "pod restart events", "OOM events", "K8s event history". Do NOT use for explaining existing queries, product documentation questions, AWS-specific resource queries, service-level RED metrics, distributed tracing, or log analysis — use the relevant skill instead.

Provider: dynatrace Path in repo: skills/dt-obs-kubernetes/SKILL.md

Skill body

Infrastructure Kubernetes

Monitor and analyze Kubernetes infrastructure using Dynatrace DQL. Query cluster resources, monitor workload health, analyze pod placement, optimize costs, and assess security posture.

When to Use This Skill

Core Monitoring (Start Here)

  1. Cluster Inventoryreferences/cluster-inventory.md - Clusters, namespaces, resource distribution
  2. Node Monitoring - Node capacity, CPU/memory usage, pod density
  3. Pod Monitoring - Pod CPU, memory, lifecycle events
  4. Workload Monitoring - Deployment, StatefulSet, DaemonSet resources

Advanced Topics

  1. Configuration Analysisreferences/labels-annotations.md - Parse k8s.object, labels, annotations
  2. Scheduling & Placementreferences/pod-node-placement.md - Node selectors, affinity, taints, HA
  3. Cost Optimization - Right-sizing, waste detection, efficiency scoring
  4. Security & Compliance - Privileged containers, security contexts

Key Concepts

Entity Types

Workloads: K8S_DEPLOYMENT, K8S_STATEFULSET, K8S_DAEMONSET, K8S_JOB, K8S_CRONJOB, K8S_HORIZONTALPODAUTOSCALER
Infrastructure: K8S_CLUSTER, K8S_NAMESPACE, K8S_NODE, K8S_POD
Configuration: K8S_SERVICE, K8S_CONFIGMAP, K8S_SECRET, K8S_PERSISTENTVOLUMECLAIM, K8S_PERSISTENTVOLUME, K8S_INGRESS, K8S_NETWORKPOLICY

Query Types

smartscapeNodes - Query K8s entities:

smartscapeNodes K8S_POD
| filter k8s.namespace.name == "production"
| fields k8s.cluster.name, k8s.pod.name

timeseries - Monitor metrics over time:

timeseries cpu = sum(dt.kubernetes.container.cpu_usage),
  by: {k8s.pod.name, k8s.namespace.name}
| fieldsAdd avg_cpu = arrayAvg(cpu)

fetch logs - Analyze log events:

fetch logs
| filter k8s.namespace.name == "production" and loglevel == "ERROR"

Core Fields

Available Metrics

CPU: dt.kubernetes.container.cpu_usage, cpu_throttled, limits_cpu, requests_cpu
Memory: dt.kubernetes.container.memory_working_set, limits_memory, requests_memory
Operations: dt.kubernetes.container.restarts, oom_kills
Node: dt.kubernetes.node.pods_allocatable, cpu_allocatable, memory_allocatable, dt.kubernetes.pods

Entity Disambiguation

K8S_POD vs CONTAINER: these are different entity types in Dynatrace.

The smartscape edge is CONTAINER --(is_part_of)--> K8S_POD. To reach containers from a pod, traverse backward:

smartscapeNodes K8S_POD
| filter k8s.namespace.name == "<namespace>"
| traverse edgeTypes: {is_part_of}, targetTypes: {CONTAINER}, direction: backward, fieldsKeep: {id}
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, container.id=id

Service → K8S_POD Correlation

No direct smartscape edge exists between SERVICE and K8S_POD. The correlation key is the shared dimension k8s.workload.name. See Service → Pod Drill-Down in references/pod-debugging.md for the full two-step pattern.

Common Workflows

1. Cluster Health Check

List all clusters:

smartscapeNodes K8S_CLUSTER
| fields k8s.cluster.name, k8s.cluster.version, k8s.cluster.distribution

Check node capacity:

timeseries {
  current_pods = avg(dt.kubernetes.pods),
  max_pods = avg(dt.kubernetes.node.pods_allocatable)
}, by: {k8s.node.name, k8s.cluster.name}
| fieldsAdd pod_capacity_pct = (arrayAvg(current_pods) / arrayAvg(max_pods)) * 100
| filter pod_capacity_pct > 80

Identify pods in non-Running state:

smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| fieldsAdd phase = config[status][phase]
| filter phase != "Running"
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, phase

2. Resource Optimization

Find over-provisioned pods (usage < 30%):

timeseries {
  cpu_usage = sum(dt.kubernetes.container.cpu_usage),
  cpu_requests = avg(dt.kubernetes.container.requests_cpu)
}, by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd usage_pct = (arrayAvg(cpu_usage) / arrayAvg(cpu_requests)) * 100
| filter usage_pct < 30 and arrayAvg(cpu_requests) > 0

Identify containers without limits:

smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
    container_name = container[name],
    cpu_limit = container[resources][limits][cpu],
    memory_limit = container[resources][limits][memory]
| filter isNull(cpu_limit) or isNull(memory_limit)

3. Troubleshooting Pod Issues

Pod troubleshooting benefits from combining metrics (timeseries) with Kubernetes events (event stream) for a complete picture.

Metrics-Based Troubleshooting

Find pods with OOMKills:

timeseries oom_kills = sum(dt.kubernetes.container.oom_kills),
  by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| filter arraySum(oom_kills) > 0
| fieldsAdd total_oom_kills = arraySum(oom_kills)
| sort total_oom_kills desc

Analyze pod restart patterns:

timeseries restarts = sum(dt.kubernetes.container.restarts),
  by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd total_restarts = arraySum(restarts)
| filter total_restarts > 5

Event-Based Troubleshooting

For operational events (pod restarts, OOM kills, evictions, scheduling failures), Kubernetes events provide richer context than metrics alone — including event reasons, messages, and timestamps.

When to use Kubernetes events over metrics:

Kubernetes events are available through the get-events-for-kubernetes-cluster tool. Prefer this tool when the user asks about OOM events, pod restarts, evictions, or cluster-wide event history.

Important: distinguish event types when filtering results. Kubernetes events cover many categories. When the user asks about a specific event type, filter the results accordingly — do not report unrelated events:

User Asks About Relevant Event Reasons NOT Related
Pod restarts BackOff, CrashLoopBackOff, Killing Readiness probe failures, CPU throttling
OOM events OOMKilling, OOMKilled Memory pressure warnings
Evictions Evicted, Preempting Node pressure
Scheduling failures FailedScheduling, Unschedulable Resource quotas

For a complete answer, combine both approaches:

  1. Use the events tool to get the event details (what happened, when, why)
  2. Use timeseries metrics to show the quantitative impact (how many restarts, OOM kill counts over time)

Fetch Kubernetes Events via DQL

Pod restart and operational events can also be queried via DQL from the events table:

fetch events
| filter event.kind == "K8S_EVENT"
| filter event.type == "Warning"
| fields timestamp, k8s.cluster.name, k8s.namespace.name, k8s.pod.name,
    event.reason, event.message
| sort timestamp desc
| limit 50

Filter for specific event reasons:

fetch events
| filter event.kind == "K8S_EVENT"
| filter in(event.reason, {"OOMKilling", "BackOff", "Evicted", "FailedScheduling"})
| fields timestamp, k8s.cluster.name, k8s.namespace.name, k8s.pod.name,
    event.reason, event.message
| sort timestamp desc

Field names in fetch events: Use event.reason and event.message — not dt.kubernetes.event.reason. The dt.kubernetes.* prefix is for timeseries metrics, not the events table. Queries using the wrong prefix return zero results.

4. Security Assessment

Identify privileged containers:

smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
    container_name = container[name],
    privileged = container[securityContext][privileged]
| filter privileged == true

Find containers running as root:

smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
    container_name = container[name],
    run_as_user = container[securityContext][runAsUser],
    run_as_non_root = container[securityContext][runAsNonRoot]
| filter (isNull(run_as_user) or run_as_user == 0) and run_as_non_root != true

5. Scheduling Analysis

Verify pod distribution (HA compliance):

smartscapeNodes K8S_POD
| filter k8s.workload.kind == "deployment"
| summarize pod_count = count(),
            node_count = countDistinct(k8s.node.name),
            by: {k8s.cluster.name, k8s.namespace.name, k8s.workload.name}
| fieldsAdd ha_compliant = node_count > 1
| filter pod_count >= 2 and not ha_compliant

6. DAVIS Problems affecting K8s Entities

Find active DAVIS problems affecting K8s entities:

fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(smartscape.affected_entity.types, "K8S_")
| fields display_id, event.name, event.category, smartscape.affected_entity.ids

Use entries smartscape.affected_entity.ids (array of Smartscape IDs) to look up the affected entity using its Smartscape ID.

Best Practices

Choosing the Right Data Source

User Question Best Approach Why
“Show me OOM events” Events tool + metrics Events give reasons/messages; metrics show trends
“Show me pod restart events” Events tool + timeseries metrics Events reveal the reason (BackOff, Killing, CrashLoopBackOff); dt.kubernetes.container.restarts metric gives the actual restart counts
“How many pod restarts?” Timeseries metrics Quantitative data over time
“What happened to my pods in the last 48h?” Events tool Operational event history with context
“Which pods are using the most CPU?” Timeseries metrics Resource utilization analysis
“List all clusters/namespaces” smartscapeNodes Entity discovery and inventory
“Are there scheduling failures?” Events tool Event reasons explain why

Query Performance

  1. Filter early - Apply cluster/namespace filters immediately
  2. Use specific entity types - Avoid wildcards
  3. Limit result sets - Use limit for exploration
  4. Cache cluster lists - Store in variables

Monitoring Recommendations

  1. Set resource limits on all containers
  2. Monitor OOMKills and adjust memory limits
  3. Track CPU throttling and adjust CPU limits
  4. Review resource efficiency regularly (target 70-80%)
  5. Implement security best practices (non-root, read-only filesystem)
  6. Use specific image tags (avoid :latest)

Configuration Standards

  1. Use labels for organization (app, environment, team)
  2. Set resource requests and limits
  3. Configure health checks (liveness/readiness probes)
  4. Use TLS for all ingress resources
  5. Document with annotations

Troubleshooting

Problem Cause Solution
No pod data returned Wrong entity type or missing cluster filter Use K8S_POD (not POD); add k8s.cluster.name filter
k8s.object parsing errors Complex JSON structure Use parse k8s.object, "JSON:config" then access nested fields
Pod network metrics unavailable Not available in Grail Use service mesh metrics or host-level network metrics
Large result sets No time range or cluster filter Add time range and filter by cluster/namespace early
Missing labels in output Labels accessed incorrectly Use tags[label_name] to access labels

Limitations

Unavailable Metrics:

Query Considerations:

When to Load References

Load cluster-inventory.md when:

references/cluster-inventory.md

Load labels-annotations.md when:

references/labels-annotations.md

Load pod-node-placement.md when:

references/pod-node-placement.md

Load pod-debugging.md when:

references/pod-debugging.md

Load workload-health.md when:

references/workload-health.md

Load pv-pvc.md when:

references/pv-pvc.md

Load ingress.md when:

references/ingress.md

Load network-policies.md when:

references/network-policies.md

References

Skill frontmatter

license: Apache-2.0