Agent Skill · Cockroach Labs

managing-cluster-capacity

Manages CockroachDB cluster capacity across all tiers. Self-Hosted covers node decommissioning for permanent removal and adding nodes for expansion. Advanced/BYOC covers scaling node count and machine size via Cloud Console, API, or Terraform. Standard covers adjusting provisioned compute (vCPUs). Basic auto-scales — guidance covers spending limits and cost management. Use when scaling capacity up or down, permanently removing nodes, or managing costs.

Provider: Cockroach Labs Path in repo: skills/cockroachdb-operations-and-lifecycle/managing-cluster-capacity/SKILL.md

Skill body

Managing Cluster Capacity

Manages cluster capacity across all CockroachDB deployment tiers. What “capacity” means varies by tier — Self-Hosted manages individual nodes, Advanced/BYOC manage node count and machine size, Standard manages provisioned vCPUs, and Basic auto-scales with cost controls.

When to Use This Skill

For temporary maintenance (not capacity changes): Use performing-cluster-maintenance. For pre-operation health check: Use reviewing-cluster-health.


Step 1: Gather Context

Required Context

Question Options Why It Matters
Deployment tier? Self-Hosted, Advanced, BYOC, Standard, Basic Different capacity model per tier
Direction? Scale up (add capacity), Scale down (reduce capacity) Determines procedure

Additional Context (by tier)

If Self-Hosted (scaling down):

Question Options Why It Matters
How many nodes to remove? 1, multiple Multi-node decommission should be done simultaneously
Target node IDs? Node IDs from cockroach node status Required for CLI commands
Is the node alive or dead? Alive, Dead Dead nodes use a different procedure
Deployment platform? Bare metal, VMs, Kubernetes Changes CLI and cleanup steps
Current replication factor? 3, 5, custom Must have enough nodes remaining
Current node count? Number Validates remaining capacity
Storage utilization? Low (<60%), Medium (60-80%), High (>80%) Determines urgency and whether storage maintenance is needed

If Advanced or BYOC:

Question Options Why It Matters
Scale method? Cloud Console, API, Terraform Determines procedure
Current and target configuration? e.g., 5 nodes → 3 nodes, or 4 vCPU → 8 vCPU Validates constraints
Cloud provider? (BYOC only) AWS, GCP, Azure Affects infrastructure verification

If Standard:

Question Options Why It Matters
Current provisioned vCPUs? Number Context for scaling decision
Target vCPUs? Number Validates workload will fit

If Basic: Gather cost management goals — Basic auto-scales with no manual capacity control.

Context-Driven Routing

Tier Go To
Self-Hosted Self-Hosted Capacity Management
Advanced Advanced Scaling
BYOC BYOC Scaling
Standard Standard Compute Management
Basic Basic Cost Management

Self-Hosted Capacity Management

Applies when: Tier = Self-Hosted

Scaling Down: Decommission Nodes

Pre-Decommission Validation

# All nodes live, version-consistent, with replication and per-node range counts
cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>

Inspect the output for:

-- Replication factor (and other zone-level settings)
SHOW ZONE CONFIGURATION FOR RANGE default;

For per-store capacity (so you can verify remaining nodes won’t exceed 60% utilization after absorbing the decommissioned node’s data), use the DB Console OverviewStorage page or scrape the Prometheus metrics endpoint:

curl -ks https://<node>:8080/_status/vars | grep '^capacity'

Node count after decommission must be ≥ the zone’s num_replicas.

If Node Is Alive: Drain Then Decommission

# Step 1: Drain
cockroach node drain <node_id> --certs-dir=<certs-dir> --host=<any-live-node>

# Step 2: Decommission (single node)
cockroach node decommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>

# Step 2: Decommission (multiple nodes — more efficient, do simultaneously)
cockroach node decommission <id_1> <id_2> <id_3> --certs-dir=<certs-dir> --host=<any-live-node>

If Node Is Dead: Replace Failed Node

When a node has been dead longer than server.time_until_store_dead (default 5m), CockroachDB automatically re-replicates its data to surviving nodes. Use this procedure to clean up the dead node and optionally add a replacement.

Step 1: Confirm the node is dead and data is safe

# Confirm the dead node and verify replication has caught up
cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>

In the output: the dead node should show is_live = false, and every surviving node should show ranges_underreplicated = 0. For per-store capacity on the surviving nodes, use the DB Console OverviewStorage page.

If under-replicated ranges exist, wait for re-replication to complete before proceeding.

Step 2: Decommission the dead node (metadata cleanup)

cockroach node decommission <dead_node_id> --certs-dir=<certs-dir> --host=<any-live-node>

Step 3: Add a replacement node (recommended)

If remaining nodes are above 60% utilization, provision a replacement node using the Scaling Up: Add Nodes procedure.

Multiple dead nodes: Decommission all dead nodes simultaneously:

cockroach node decommission <id_1> <id_2> --certs-dir=<certs-dir> --host=<any-live-node>

See replacing-failed-nodes reference for detailed failure scenarios and recovery procedures.

Monitor Decommission Progress

cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>

Wait for gossiped_replicas = 0 and membership = 'decommissioned'. Then stop the process on the decommissioned node.

Cancel a Decommission

cockroach node recommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>

Only works while still in decommissioning state.

Scaling Up: Add Nodes

  1. Provision new hardware/VM with same specs as existing nodes
  2. Install same CockroachDB version (cockroach version to confirm)
  3. Start node with --join pointing to existing cluster nodes
  4. Verify join and monitor rebalancing:
    cockroach node status --certs-dir=<certs-dir> --host=<any-live-node>
    

    The new node should appear in the output with is_live = true. The ranges column climbs as data rebalances toward the new node.

Post-Scaling Verification

cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>

Expect ranges_underreplicated = 0 on every node and a balanced ranges count across nodes. For per-store capacity utilization, use the DB Console OverviewStorage page.


Advanced Scaling

Applies when: Tier = Advanced

Advanced clusters are managed by Cockroach Labs. Capacity is adjusted by changing node count or machine size.

Via Cloud Console

  1. Cluster → Capacity
  2. Adjust node count or machine type (vCPUs per node)
  3. CRL handles all node operations (drain, decommission, provisioning) safely
  4. Monitor progress in Cloud Console

Via Cloud API

# Scale node count
curl -X PATCH -H "Authorization: Bearer $COCKROACH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"config": {"num_nodes": <new_count>}}' \
  "https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>"

Via Terraform

resource "cockroach_cluster" "example" {
  dedicated {
    num_virtual_cpus = 8     # vCPUs per node
    storage_gib      = 150
    num_nodes        = 5     # total nodes
  }
}

Pre-Scaling Check

-- Ensure no disruptive jobs are running before scaling down
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j WHERE status = 'running' GROUP BY 1, 2;

Constraints


BYOC Scaling

Applies when: Tier = BYOC

Follow all Advanced Scaling steps. BYOC scaling is managed through the same Cloud Console/API/Terraform interfaces.

Cloud Provider Verification (after scaling down)

If AWS:

aws ec2 describe-instances --filters "Name=tag:cockroach-cluster,Values=<cluster-name>" \
  --query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name}'

If GCP:

gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"

If Azure:

az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<name>']"

Additional BYOC Considerations


Standard Compute Management

Applies when: Tier = Standard

Standard is a multi-tenant managed service. There are no individual nodes. Capacity is managed by adjusting provisioned compute (vCPUs).

Adjust Provisioned vCPUs

  1. Cloud Console → Cluster → Capacity
  2. Increase or decrease provisioned vCPUs
  3. Change takes effect without downtime

Before Scaling Down

After Scaling

Monitor P99 latency and QPS in Cloud Console for 24-48 hours. If latency increases after scaling down, scale compute back up.


Basic Cost Management

Applies when: Tier = Basic

Basic is a serverless offering that auto-scales. There are no nodes or provisioned compute to manage. Capacity scales automatically based on demand. Cost is managed through spending controls.

Manage Spending

When to Consider Upgrading

If you need explicit control over compute capacity (guaranteed vCPUs), consider upgrading to Standard. If you need dedicated infrastructure, consider Advanced.


Safety Considerations

Operation Tier Reversible?
cockroach node decommission SH Recommission only before completion
Stop decommissioned node SH No (must rejoin as new node)
Add node to cluster SH Yes (decommission to remove)
Scale via Console/API ADV/BYOC Contact support to reverse
Adjust provisioned vCPUs STD Yes (scale back)
Set spending limit BAS Yes (adjust anytime)

Critical (Self-Hosted):

Troubleshooting

Issue Tier Fix
Decommission hangs SH Check zone config constraints; investigate stalled ranges
Recommission fails SH Node already fully decommissioned; must rejoin as new
New node not rebalancing SH Wait for automatic rebalancing; check range_count
Scale-down rejected ADV/BYOC Below minimum or data won’t fit
Latency spike after reduction STD Scale provisioned vCPUs back up
Cloud instances not cleaned up BYOC Contact support; verify in cloud console
Dead node not re-replicating SH Check server.time_until_store_dead; verify surviving nodes have capacity
Storage utilization high after scale-down SH Add replacement node or increase disk size

References

Skill references:

Related skills:

Official CockroachDB Documentation:

Skill frontmatter

compatibility: Self-Hosted requires CLI access and admin role. Advanced/BYOC requires Cloud Console with Cluster Admin role. Standard requires Cloud Console. Basic has no capacity management — spending limits only. metadata: {"author" => "cockroachdb", "version" => "2.0"}