Agent Skill · Cockroach Labs

triaging-live-sql-activity

Diagnoses live CockroachDB cluster performance issues by identifying long-running queries, busy sessions, and active transactions using SQL-only interfaces. Use when users report cluster slowness, high CPU, or need to find runaway queries and their source applications without DB Console access.

Provider: Cockroach Labs Path in repo: skills/cockroachdb-observability-and-diagnostics/triaging-live-sql-activity/SKILL.md

Skill body

Triaging Live SQL Activity

Diagnoses live cluster performance issues by identifying currently active long-running queries, busy sessions, and active transactions. Uses SQL-only interfaces (SHOW statements and crdb_internal views) to provide immediate triage without requiring DB Console, HTTP endpoints, or Prometheus access.

When to Use This Skill

For historical performance analysis: Use profiling-statement-fingerprints to analyze query patterns over time, identify slow fingerprints, and investigate trends without needing live queries. For transaction-level analysis: Use profiling-transaction-fingerprints to analyze historical transaction retry patterns, commit latency trends, and statement composition. For background job monitoring: Use monitoring-background-jobs to monitor schema changes, backups, and automatic jobs that don’t appear in SHOW CLUSTER STATEMENTS.

Prerequisites

Required SQL access:

Check your privileges:

SHOW SYSTEM GRANTS FOR <username>;

See permissions reference for detailed RBAC setup.

Core Diagnostic Approach

CockroachDB provides SQL-only interfaces for live activity triage:

Interface Purpose Cluster-wide?
SHOW CLUSTER STATEMENTS Currently executing queries Yes (with VIEWACTIVITY)
SHOW CLUSTER SESSIONS Active client sessions Yes (with VIEWACTIVITY)
crdb_internal.cluster_transactions In-progress transactions Yes (with VIEWACTIVITY)

Triage workflow:

  1. Identify long-running queries (> 5-10 minutes)
  2. Correlate to sessions and applications
  3. Check transaction retry counts (high retries = contention)
  4. Drill down by app/user/client
  5. (Optional) Cancel runaway work

Safety: All diagnostic queries are read-only. Cancellation is opt-in with explicit warnings.

Core Diagnostic Queries

Long-Running Queries

Identify queries running longer than a specified threshold:

-- Queries running longer than 5 minutes
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT
  query_id,
  node_id,
  session_id,
  user_name,
  client_address,
  application_name,
  start,
  now() - start AS running_for,
  substring(query, 1, 200) AS query_preview,
  distributed,
  phase
FROM q
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;

Key columns:

Customizable thresholds:

Active Sessions

Find sessions with long-running active queries:

-- Sessions with active queries running > 5 minutes
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT
  node_id,
  session_id,
  user_name,
  client_address,
  application_name,
  status,
  active_query_start,
  now() - active_query_start AS active_query_for,
  substring(active_queries, 1, 200) AS active_queries_preview,
  substring(last_active_query, 1, 200) AS last_query_preview
FROM s
WHERE active_query_start IS NOT NULL
  AND active_query_start < now() - INTERVAL '5 minutes'
ORDER BY active_query_start
LIMIT 50;

Key columns:

Active Transactions

Identify long-running transactions (potential blockers):

-- Transactions running > 5 minutes
SELECT
  id AS txn_id,
  node_id,
  session_id,
  application_name,
  start,
  now() - start AS running_for,
  num_stmts,
  num_retries,
  num_auto_retries,
  substring(txn_string, 1, 200) AS txn_string_preview
FROM crdb_internal.cluster_transactions
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;

Key columns:

Production safety note: crdb_internal.cluster_transactions is production-approved and safe for triage.

Drill-Down by Application, User, or Client

Once you identify suspicious activity, drill down by filtering:

Filter by Application

-- All activity from specific application
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, user_name, start, now() - start AS running_for,
       substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
ORDER BY start;

Filter by User

-- All activity from specific user
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, application_name, client_address,
       active_query_start, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE user_name = 'app_user'
  AND active_query_start IS NOT NULL
ORDER BY active_query_start;

Filter by Client Address

-- All sessions from specific client IP
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, user_name, application_name,
       status, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE client_address LIKE '10.0.1.%'
ORDER BY active_query_start;

Combined Filters

-- Long queries from specific app and user
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, node_id, start, now() - start AS running_for,
       substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
  AND user_name = 'app_user'
  AND start < now() - INTERVAL '10 minutes'
ORDER BY start;

Safety Considerations

Read-only operations: All diagnostic queries (SHOW statements, crdb_internal.cluster_transactions) are read-only and safe to run in production.

Cancellation operations (opt-in):

CAUTION: Canceling queries/sessions terminates user work

Only proceed if:

Canceling Runaway Work (Opt-In)

Cancel a Specific Query

-- 1. Identify the query_id from triage queries above
-- 2. Cancel it
CANCEL QUERY '<query_id>';

Example:

CANCEL QUERY '15f9e0e91f072f0f0000000000000001';

Cancel an Entire Session

-- 1. Identify the session_id from triage queries above
-- 2. Cancel all queries in that session
CANCEL SESSION '<session_id>';

Example:

CANCEL SESSION '15f9e0e91f072f0f';

Verification: After canceling, re-run the triage queries to confirm the query/session is gone.

Required privileges:

See permissions reference for granting these privileges.

Common Triage Workflows

Workflow 1: “Cluster is slow” investigation

Scenario: Users report general slowness.

  1. Check for long-running queries:
    -- Run the "Long-Running Queries" diagnostic
    -- Look for queries running > 5-10 minutes
    
  2. Identify source applications:
    -- Group by application to find culprits
    WITH q AS (SHOW CLUSTER STATEMENTS)
    SELECT application_name, COUNT(*) AS num_queries,
           AVG(now() - start) AS avg_duration
    FROM q
    WHERE start < now() - INTERVAL '5 minutes'
    GROUP BY application_name
    ORDER BY num_queries DESC;
    
  3. Drill down into specific app:
    -- Filter by top application from step 2
    -- Use "Filter by Application" query
    
  4. Decide on action:
    • Contact app team to investigate query patterns
    • Cancel specific runaway queries if critical
    • Check for schema/index issues if queries are legitimate

Workflow 2: Find high-retry transactions

Scenario: Suspect contention issues.

  1. Check for high retry counts:
    SELECT application_name, AVG(num_retries) AS avg_retries,
           MAX(num_retries) AS max_retries, COUNT(*) AS num_txns
    FROM crdb_internal.cluster_transactions
    WHERE start < now() - INTERVAL '5 minutes'
    GROUP BY application_name
    HAVING AVG(num_retries) > 5
    ORDER BY avg_retries DESC;
    
  2. Investigate specific transactions:
    -- Find transactions with >10 retries
    SELECT id, application_name, num_retries, num_stmts,
           substring(txn_string, 1, 200) AS txn_preview
    FROM crdb_internal.cluster_transactions
    WHERE num_retries > 10
    ORDER BY num_retries DESC;
    
  3. Next steps:
    • Review transaction patterns for contention
    • Check for lock conflicts or hotspots
    • Consider schema changes to reduce contention

Workflow 3: Identify resource hogs by user

Scenario: Need to attribute load to specific users.

  1. Count active queries per user:
    WITH q AS (SHOW CLUSTER STATEMENTS)
    SELECT user_name, COUNT(*) AS num_active_queries,
           AVG(now() - start) AS avg_duration
    FROM q
    GROUP BY user_name
    ORDER BY num_active_queries DESC;
    
  2. Drill down to specific user’s activity:
    -- Use "Filter by User" query
    
  3. Take action:
    • Contact user if unexpected load
    • Review user’s query patterns
    • Cancel if clearly runaway

Troubleshooting

Issue Cause Fix
SHOW CLUSTER STATEMENTS returns empty No active queries, or insufficient privileges Grant VIEWACTIVITY or VIEWACTIVITYREDACTED; verify cluster has active load
Query text shows <hidden> Using VIEWACTIVITYREDACTED privilege This is expected for privacy; use VIEWACTIVITY if full text needed
Can’t cancel query: “permission denied” Missing CANCELQUERY privilege Grant CANCELQUERY system privilege to your user
crdb_internal.cluster_transactions slow High transaction volume on cluster Add filters (application_name, time threshold) to reduce result set
“relation does not exist” error Typo in table name or old CockroachDB version Verify you’re using production-approved tables; check CockroachDB version compatibility
Triage queries themselves are slow Cluster under extreme load Use more aggressive filters (shorter time window, specific apps); consider canceling obvious runaway work first

Key Considerations

References

Skill references:

Related skills:

Official CockroachDB Documentation:

Skill frontmatter

compatibility: Requires SQL access with VIEWACTIVITY or VIEWACTIVITYREDACTED cluster privilege for cluster-wide visibility. metadata: {"author" => "cockroachdb", "version" => "1.0"}