Agent Skill · Cockroach Labs

reviewing-cluster-health

Performs a comprehensive health check of a CockroachDB cluster. Gathers deployment context first, then provides tier-appropriate diagnostics. Self-Hosted uses SQL against node-level system tables and CLI. Advanced/BYOC use Cloud Console and SQL with node visibility. Standard monitors provisioned compute and workload via Cloud Console. Basic monitors Request Unit consumption and connectivity. Use for daily checks, pre-maintenance validation, post-incident verification, or production readiness assessment.

Provider: Cockroach Labs Path in repo: skills/cockroachdb-operations-and-lifecycle/reviewing-cluster-health/SKILL.md

Skill body

Reviewing Cluster Health

Performs a comprehensive health check of a CockroachDB cluster. Before running diagnostics, this skill gathers deployment context to provide the right queries and tools for the operator’s tier.

When to Use This Skill

For live query issues: Use triaging-live-sql-activity. For background jobs: Use monitoring-background-jobs. For range analysis: Use analyzing-range-distribution.


Step 1: Gather Context

Required Context

Question Options Why It Matters
Deployment tier? Self-Hosted, Advanced, BYOC, Standard, Basic Determines available diagnostics and operator responsibilities
Reason for health check? Daily check, Pre-maintenance, Post-incident, Pre-upgrade Prioritizes which dimensions to check first

Additional Context (by tier)

If Self-Hosted:

Question Options Why It Matters
Access available? SQL + CLI, SQL only Determines which tools can be used
Cloud provider? AWS, GCP, Azure, On-Premises Affects infrastructure-level checks
Kubernetes deployment? Yes (Operator, Helm, manual), No Changes CLI commands and monitoring
Node count and regions? e.g., 9 nodes, 3 regions Sets expectations for query results

If Advanced or BYOC:

Question Options Why It Matters
Cloud provider? (BYOC only) AWS, GCP, Azure For infrastructure-level monitoring in your cloud account

If Standard:

Question Options Why It Matters
Current provisioned vCPUs? Number Context for compute utilization assessment

If Basic: No additional context needed.

Context-Driven Routing

Tier Go To
Self-Hosted Self-Hosted Health Check
Advanced Advanced Health Check
BYOC BYOC Health Check
Standard Standard Health Check
Basic Basic Health Check

Self-Hosted Health Check

Applies when: Tier = Self-Hosted

Self-Hosted node-level health is read primarily through cockroach node status (CLI) and the DB Console. Cluster settings and jobs are read through public SQL (SHOW ALL CLUSTER SETTINGS, SHOW JOBS). The crdb_internal virtual tables for cluster topology, storage, and certificates are not for production use — see the docs for the production-safe table list.

Check 1: Node Liveness, Version, and Replication

cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>

Key columns:

For finer-grained range breakdown, use the DB Console Replication page.

Check 2: Storage Capacity

No production-safe SQL view exposes per-store capacity. Use:

Check 3: Certificate Expiration

No SQL view exposes node certificate expiration. Use one of:

Treat anything within 90 days as EXPIRING_SOON.

Check 4: Critical Settings

SELECT variable, value FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
  'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
  'server.time_until_store_dead', 'admission.kv.enabled',
  'cluster.preserve_downgrade_option'
) ORDER BY variable;

gc.ttlseconds is a zone-config parameter, not a cluster setting; check the effective value with SHOW ZONE CONFIGURATION FOR ... against the relevant table/database/range.

Check 5: Consolidated Summary

The DB Console Cluster Overview page consolidates live/dead node count, version distribution, range counts, and storage. From the CLI:

cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>

then aggregate the columns of interest in your shell. The cluster’s logical version comes from SQL:

SELECT value AS cluster_version FROM [SHOW CLUSTER SETTING version];

If reason = Pre-maintenance, also check for running jobs:

WITH j AS (SHOW JOBS)
SELECT job_type, COUNT(*) FROM j WHERE status = 'running' GROUP BY job_type;

Check 6: Production Readiness Assessment

Use when verifying a cluster is ready for production workloads or during periodic operational reviews.

# Node count, liveness, and locality diversity
cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>

In the output, count rows with is_live = true (production wants ≥ 3) and check that locality shows multiple regions/zones.

-- Critical production settings check
SELECT variable, value,
  CASE
    WHEN variable = 'kv.rangefeed.enabled' AND value = 'true' THEN 'OK'
    WHEN variable = 'kv.rangefeed.enabled' AND value = 'false' THEN 'WARN: should be true for CDC'
    WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'true' THEN 'OK'
    WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'false' THEN 'WARN: should be true'
    WHEN variable = 'admission.kv.enabled' AND value = 'true' THEN 'OK'
    WHEN variable = 'admission.kv.enabled' AND value = 'false' THEN 'WARN: recommended for production'
    WHEN variable = 'cluster.preserve_downgrade_option' AND value != '' THEN 'INFO: finalization pending'
    ELSE 'OK'
  END AS assessment
FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
  'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
  'admission.kv.enabled', 'cluster.preserve_downgrade_option',
  'server.time_until_store_dead'
) ORDER BY variable;

-- Enterprise license status (Self-Hosted only)
SELECT value AS organization FROM [SHOW CLUSTER SETTING cluster.organization];

See production-readiness reference for the full production readiness checklist.


Advanced Health Check

Applies when: Tier = Advanced

Advanced clusters are dedicated single-tenant clusters managed by Cockroach Labs. You have node-level visibility via both Cloud Console and SQL.

Cloud Console Checks

  1. Cluster Overview — verify all nodes are live, check node count
  2. Metrics — CPU utilization, QPS, P99 latency, storage utilization
  3. Alerts — check for active alerts

CLI + SQL Checks

# Node liveness, version, and replication status
cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>

Look at is_live, build, and ranges_underreplicated per node.

-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j
WHERE status IN ('running', 'failed') AND created > now() - INTERVAL '24 hours'
GROUP BY job_type, status;

Cloud API

curl -s -H "Authorization: Bearer $COCKROACH_API_KEY" \
  "https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>" | jq '.state, .cockroach_version'

BYOC Health Check

Applies when: Tier = BYOC

BYOC clusters are dedicated and run in your cloud account. You have the same CockroachDB visibility as Advanced, plus direct access to the underlying infrastructure.

CockroachDB Health

Run all Advanced Health Check steps.

Cloud Provider Infrastructure Checks

If AWS:

aws ec2 describe-instance-status --filters "Name=tag:cockroach-cluster,Values=<cluster-name>"

If GCP:

gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"

If Azure:

az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<cluster-name>']"

Additional BYOC Checks


Standard Health Check

Applies when: Tier = Standard

Standard is a multi-tenant managed service. There are no individual nodes to monitor — Cockroach Labs manages all infrastructure, replication, and capacity. Health checking focuses on your workload performance and provisioned compute.

Cloud Console Checks

  1. Cluster Overview — verify cluster state is RUNNING
  2. SQL Activity — statement and transaction latency, error rates
  3. Storage — current usage
  4. Compute — provisioned vCPU utilization

SQL Checks

-- Verify connectivity
SELECT 1;

-- Current version
SELECT version();

-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';

What to Monitor

Note: Node-level visibility is not available on Standard. Use Cloud Console for all infrastructure health monitoring.


Basic Health Check

Applies when: Tier = Basic

Basic is a serverless offering that auto-scales. There are no nodes or provisioned compute to monitor. Cockroach Labs manages all infrastructure. Health checking focuses on connectivity, consumption, and spending.

Cloud Console Checks

  1. Cluster Overview — verify state is RUNNING
  2. Request Units — consumption rate and remaining budget
  3. Storage — current usage (10 GiB included free)
  4. Spending Limits — verify limits are configured to avoid unexpected charges

SQL Checks

-- Verify connectivity
SELECT 1;

-- Current version
SELECT version();

-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';

What to Monitor


Safety Considerations

All checks in this skill are read-only. No data is modified.

Troubleshooting

Issue Tier Fix
cockroach node status errors with permission denied SH Use a cert with admin or VIEWCLUSTERMETADATA
Node missing from cockroach node status output SH Check node process; verify --join address
Standard/Basic SQL doesn’t expose node tables STD/BAS Expected — use Cloud Console
Cloud Console shows degraded ADV/BYOC Check Cloud status page; contact support
High RU consumption BAS Profile queries; set spending limits
Cloud API returns 401 ADV/BYOC Regenerate API key
High latency on first connection BAS Expected cold start after idle period

References

Skill references:

Related skills:

Official CockroachDB Documentation:

Skill frontmatter

compatibility: Self-Hosted requires SQL access with admin or VIEWCLUSTERMETADATA privilege. Advanced/BYOC require Cloud Console and SQL connectivity. Standard requires Cloud Console and SQL. Basic requires Cloud Console. metadata: {"author" => "cockroachdb", "version" => "2.0"}