reviewing-cluster-health
Performs a comprehensive health check of a CockroachDB cluster. Gathers deployment context first, then provides tier-appropriate diagnostics. Self-Hosted uses SQL against node-level system tables and CLI. Advanced/BYOC use Cloud Console and SQL with node visibility. Standard monitors provisioned compute and workload via Cloud Console. Basic monitors Request Unit consumption and connectivity. Use for daily checks, pre-maintenance validation, post-incident verification, or production readiness assessment.
Skill body
Reviewing Cluster Health
Performs a comprehensive health check of a CockroachDB cluster. Before running diagnostics, this skill gathers deployment context to provide the right queries and tools for the operator’s tier.
When to Use This Skill
- Daily or shift-start operational health checks
- Before starting maintenance (Self-Hosted, Advanced, BYOC)
- After incidents to confirm recovery
- Verifying production readiness
- Monitoring capacity and performance
For live query issues: Use triaging-live-sql-activity. For background jobs: Use monitoring-background-jobs. For range analysis: Use analyzing-range-distribution.
Step 1: Gather Context
Required Context
| Question | Options | Why It Matters |
|---|---|---|
| Deployment tier? | Self-Hosted, Advanced, BYOC, Standard, Basic | Determines available diagnostics and operator responsibilities |
| Reason for health check? | Daily check, Pre-maintenance, Post-incident, Pre-upgrade | Prioritizes which dimensions to check first |
Additional Context (by tier)
If Self-Hosted:
| Question | Options | Why It Matters |
|---|---|---|
| Access available? | SQL + CLI, SQL only | Determines which tools can be used |
| Cloud provider? | AWS, GCP, Azure, On-Premises | Affects infrastructure-level checks |
| Kubernetes deployment? | Yes (Operator, Helm, manual), No | Changes CLI commands and monitoring |
| Node count and regions? | e.g., 9 nodes, 3 regions | Sets expectations for query results |
If Advanced or BYOC:
| Question | Options | Why It Matters |
|---|---|---|
| Cloud provider? (BYOC only) | AWS, GCP, Azure | For infrastructure-level monitoring in your cloud account |
If Standard:
| Question | Options | Why It Matters |
|---|---|---|
| Current provisioned vCPUs? | Number | Context for compute utilization assessment |
If Basic: No additional context needed.
Context-Driven Routing
| Tier | Go To |
|---|---|
| Self-Hosted | Self-Hosted Health Check |
| Advanced | Advanced Health Check |
| BYOC | BYOC Health Check |
| Standard | Standard Health Check |
| Basic | Basic Health Check |
Self-Hosted Health Check
Applies when: Tier = Self-Hosted
Self-Hosted node-level health is read primarily through cockroach node status (CLI) and the DB Console. Cluster settings and jobs are read through public SQL (SHOW ALL CLUSTER SETTINGS, SHOW JOBS). The crdb_internal virtual tables for cluster topology, storage, and certificates are not for production use — see the docs for the production-safe table list.
Check 1: Node Liveness, Version, and Replication
cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>
Key columns:
is_live—falserequires immediate investigationis_draining,is_decommissioning,membership— flag in-progress lifecycle operationsstarted_at— compare across runs to spot flapping (node restarts)build— version per node; should be a single value (or two during a rolling upgrade)ranges_underreplicated— non-zero indicates ranges below the zone’snum_replicas
For finer-grained range breakdown, use the DB Console Replication page.
Check 2: Storage Capacity
No production-safe SQL view exposes per-store capacity. Use:
- DB Console Overview → Storage for per-node usage
- The Prometheus metric endpoint on each node:
curl -ks https://<node>:8080/_status/vars | grep '^capacity'(capacity,capacity_used,capacity_available)
Check 3: Certificate Expiration
No SQL view exposes node certificate expiration. Use one of:
cockroach cert list --certs-dir=<certs-dir>to inspect certs locally on each nodeopenssl x509 -in <cert.crt> -noout -enddatefor a single cert file- The Prometheus metric endpoint:
curl -ks https://<node>:8080/_status/vars | grep '^security_certificate_expiration_'(UNIX-timestamp seconds;node,ca,client_ca,ui_ca)
Treat anything within 90 days as EXPIRING_SOON.
Check 4: Critical Settings
SELECT variable, value FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
'server.time_until_store_dead', 'admission.kv.enabled',
'cluster.preserve_downgrade_option'
) ORDER BY variable;
gc.ttlseconds is a zone-config parameter, not a cluster setting; check the effective value with SHOW ZONE CONFIGURATION FOR ... against the relevant table/database/range.
Check 5: Consolidated Summary
The DB Console Cluster Overview page consolidates live/dead node count, version distribution, range counts, and storage. From the CLI:
cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>
then aggregate the columns of interest in your shell. The cluster’s logical version comes from SQL:
SELECT value AS cluster_version FROM [SHOW CLUSTER SETTING version];
If reason = Pre-maintenance, also check for running jobs:
WITH j AS (SHOW JOBS)
SELECT job_type, COUNT(*) FROM j WHERE status = 'running' GROUP BY job_type;
Check 6: Production Readiness Assessment
Use when verifying a cluster is ready for production workloads or during periodic operational reviews.
# Node count, liveness, and locality diversity
cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>
In the output, count rows with is_live = true (production wants ≥ 3) and check that locality shows multiple regions/zones.
-- Critical production settings check
SELECT variable, value,
CASE
WHEN variable = 'kv.rangefeed.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'kv.rangefeed.enabled' AND value = 'false' THEN 'WARN: should be true for CDC'
WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'false' THEN 'WARN: should be true'
WHEN variable = 'admission.kv.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'admission.kv.enabled' AND value = 'false' THEN 'WARN: recommended for production'
WHEN variable = 'cluster.preserve_downgrade_option' AND value != '' THEN 'INFO: finalization pending'
ELSE 'OK'
END AS assessment
FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
'admission.kv.enabled', 'cluster.preserve_downgrade_option',
'server.time_until_store_dead'
) ORDER BY variable;
-- Enterprise license status (Self-Hosted only)
SELECT value AS organization FROM [SHOW CLUSTER SETTING cluster.organization];
See production-readiness reference for the full production readiness checklist.
Advanced Health Check
Applies when: Tier = Advanced
Advanced clusters are dedicated single-tenant clusters managed by Cockroach Labs. You have node-level visibility via both Cloud Console and SQL.
Cloud Console Checks
- Cluster Overview — verify all nodes are live, check node count
- Metrics — CPU utilization, QPS, P99 latency, storage utilization
- Alerts — check for active alerts
CLI + SQL Checks
# Node liveness, version, and replication status
cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>
Look at is_live, build, and ranges_underreplicated per node.
-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j
WHERE status IN ('running', 'failed') AND created > now() - INTERVAL '24 hours'
GROUP BY job_type, status;
Cloud API
curl -s -H "Authorization: Bearer $COCKROACH_API_KEY" \
"https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>" | jq '.state, .cockroach_version'
BYOC Health Check
Applies when: Tier = BYOC
BYOC clusters are dedicated and run in your cloud account. You have the same CockroachDB visibility as Advanced, plus direct access to the underlying infrastructure.
CockroachDB Health
Run all Advanced Health Check steps.
Cloud Provider Infrastructure Checks
If AWS:
aws ec2 describe-instance-status --filters "Name=tag:cockroach-cluster,Values=<cluster-name>"
If GCP:
gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"
If Azure:
az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<cluster-name>']"
Additional BYOC Checks
- Verify VPC/network connectivity (PrivateLink, PSC, VPC Peering)
- Check IAM roles — CRL service account permissions still valid
- Review cloud provider monitoring for infrastructure-level anomalies
Standard Health Check
Applies when: Tier = Standard
Standard is a multi-tenant managed service. There are no individual nodes to monitor — Cockroach Labs manages all infrastructure, replication, and capacity. Health checking focuses on your workload performance and provisioned compute.
Cloud Console Checks
- Cluster Overview — verify cluster state is
RUNNING - SQL Activity — statement and transaction latency, error rates
- Storage — current usage
- Compute — provisioned vCPU utilization
SQL Checks
-- Verify connectivity
SELECT 1;
-- Current version
SELECT version();
-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';
What to Monitor
- P99 SQL latency — track via Cloud Console Metrics
- Error rates — check for spikes in statement errors
- Storage growth — plan based on usage trends
- Compute utilization — increase provisioned vCPUs if utilization is consistently high
Note: Node-level visibility is not available on Standard. Use Cloud Console for all infrastructure health monitoring.
Basic Health Check
Applies when: Tier = Basic
Basic is a serverless offering that auto-scales. There are no nodes or provisioned compute to monitor. Cockroach Labs manages all infrastructure. Health checking focuses on connectivity, consumption, and spending.
Cloud Console Checks
- Cluster Overview — verify state is
RUNNING - Request Units — consumption rate and remaining budget
- Storage — current usage (10 GiB included free)
- Spending Limits — verify limits are configured to avoid unexpected charges
SQL Checks
-- Verify connectivity
SELECT 1;
-- Current version
SELECT version();
-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';
What to Monitor
- Request Unit (RU) consumption — track via Cloud Console to stay within spending limits
- Storage usage — monitor growth relative to the 10 GiB free tier
- Query efficiency — optimize queries that consume excessive RUs
- Cold start latency — Basic clusters may scale to zero during inactivity; first connection after idle may have higher latency
Safety Considerations
All checks in this skill are read-only. No data is modified.
- Self-Hosted:
cockroach node statusrequires CLI access (or admin SQL privilege if you need to fall back to internal tables). Most node-level health queries have no production-safe SQL alternative. - Advanced/BYOC:
cockroach node statusworks the same way; certificate inspection is managed by Cockroach Labs. - Standard/Basic: No node-level visibility by design — use the Cloud Console.
Troubleshooting
| Issue | Tier | Fix |
|---|---|---|
cockroach node status errors with permission denied |
SH | Use a cert with admin or VIEWCLUSTERMETADATA |
Node missing from cockroach node status output |
SH | Check node process; verify --join address |
| Standard/Basic SQL doesn’t expose node tables | STD/BAS | Expected — use Cloud Console |
| Cloud Console shows degraded | ADV/BYOC | Check Cloud status page; contact support |
| High RU consumption | BAS | Profile queries; set spending limits |
| Cloud API returns 401 | ADV/BYOC | Regenerate API key |
| High latency on first connection | BAS | Expected cold start after idle period |
References
Skill references:
Related skills:
- upgrading-cluster-version
- managing-cluster-capacity
- performing-cluster-maintenance
- monitoring-background-jobs
Official CockroachDB Documentation: