monitoring-background-jobs
Monitors CockroachDB background job health by identifying failed, paused, and long-running jobs using SHOW JOBS and SHOW AUTOMATIC JOBS. Surfaces schema changes, backups/restores, automatic statistics collection, and SQL stats compaction jobs without DB Console access. Use when investigating schema change delays, failed backups, or automatic job issues.
Skill body
Monitoring Background Jobs
Monitors background job health by identifying failed, paused, and long-running jobs that are distinct from user queries. Uses SQL-only interfaces (SHOW JOBS and SHOW AUTOMATIC JOBS) to surface schema changes, backups/restores, automatic statistics collection, and SQL stats compaction without requiring DB Console access.
Prerequisites
- SQL connection with
VIEWJOBsystem privilege (read-only) orCONTROLJOBrole option (control) - Background jobs are excluded from
SHOW CLUSTER STATEMENTSand from statement statistics surfaced in the DB Console SQL Activity page
Related skills: triaging-live-sql-activity for live queries, profiling-statement-fingerprints for historical query analysis.
Key Interfaces
SHOW JOBS: User-initiated + automatic jobs (last 12h default; retention configurable via thejobs.retention_timecluster setting, default 14 days)SHOW AUTOMATIC JOBS: Automatic jobs only (AUTO CREATE STATS, SCHEMA CHANGE GC, etc.)
See job types reference and job states reference for complete catalogs.
Core Diagnostic Queries
Query 1: Failed Jobs (Last 12 Hours)
Identify jobs that failed with error messages:
-- Failed jobs in last 12 hours
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
finished,
now() - created AS total_duration,
error
FROM j
WHERE status = 'failed'
AND created > now() - INTERVAL '12 hours'
ORDER BY created DESC
LIMIT 50;
Key columns:
error: Failure reason (check for permission errors, disk space, network issues)description: Human-readable description of what the job was doingtotal_duration: How long the job ran before failing
Common failure patterns:
- Permission denied: User lacks required privileges
- Disk space: Backup destination full
- Network timeout: External storage unreachable
- Constraint violation: Restore conflicts with existing data
Query 2: Long-Running Jobs
Find jobs running longer than expected threshold:
-- Jobs running longer than 1 hour
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
status,
running_status,
created,
now() - created AS running_for,
fraction_completed,
coordinator_id
FROM j
WHERE status = 'running'
AND created < now() - INTERVAL '1 hour'
ORDER BY created
LIMIT 50;
Key columns:
running_for: Total elapsed time since job startedfraction_completed: Progress estimate (0.0 to 1.0, NULL if unavailable)running_status: Sub-state details (e.g., “waiting for MVCC GC”)
Customizable thresholds:
- Schema changes: 30 minutes to several hours (depends on table size)
- Backups: 1-6+ hours (depends on data volume)
- Automatic jobs: Usually < 30 minutes
Query 3: Paused Jobs
Identify jobs that are paused and may need attention:
-- Paused jobs needing resume
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS paused_for,
coordinator_id
FROM j
WHERE status = 'paused'
ORDER BY created
LIMIT 50;
Action required:
Resume with RESUME JOB <job_id> after verifying the pause reason.
Common reasons for paused jobs:
- Manual user pause for maintenance
- Resource constraints (cluster paused the job)
- Error requiring manual intervention
Query 4: Schema Changes Waiting for MVCC GC
Find SCHEMA CHANGE GC jobs waiting for garbage collection:
-- Schema change cleanup jobs waiting for GC
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS waiting_for,
running_status
FROM j
WHERE status = 'running'
AND job_type = 'SCHEMA CHANGE GC'
AND running_status LIKE '%waiting for MVCC GC%'
ORDER BY created
LIMIT 50;
Interpretation:
- Normal: SCHEMA CHANGE GC jobs wait for data to become garbage-collectable based on the zone-level
gc.ttlseconds(default 4 hours) - Expected duration: Up to
gc.ttlseconds+ some overhead - When to worry: Waiting > 2x
gc.ttlseconds(check the effective value withSHOW ZONE CONFIGURATION FOR ...against the affected table, database, or RANGE — zone configs cascade and may be overridden at any level)
Why this happens: After DROP TABLE/INDEX operations, CockroachDB must wait for all reads at older timestamps to complete before physically removing data. This prevents “time-travel” queries from failing.
See job states reference for detailed MVCC GC explanation.
Query 5: Automatic Job Health (24h Window)
Monitor automatic background jobs like statistics collection:
-- Automatic jobs in last 24 hours
SELECT
job_id,
job_type,
description,
status,
created,
finished,
COALESCE(finished, now()) - created AS duration
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
AND job_type IN ('AUTO CREATE STATS', 'AUTO SQL STATS COMPACTION')
ORDER BY created DESC
LIMIT 50;
Key job types:
AUTO CREATE STATS: Automatic table statistics refresh (critical for query optimizer)AUTO SQL STATS COMPACTION: Periodic cleanup of statement/transaction statistics tables
Health indicators:
- Healthy: Regular successful executions (every few hours)
- Unhealthy: No recent executions, or high failure rate
- Impact of failure: Stale statistics lead to poor query plans and slow queries
Query 6: Jobs by Type and Status
Aggregated view for pattern analysis:
-- Job distribution by type and status (last 24h)
WITH j AS (SHOW JOBS)
SELECT
job_type,
status,
COUNT(*) AS job_count,
MIN(created) AS oldest,
MAX(created) AS newest
FROM j
WHERE created > now() - INTERVAL '24 hours'
GROUP BY job_type, status
ORDER BY job_type, status;
Use case:
- Identify patterns (e.g., all BACKUP jobs failing, multiple schema changes stuck)
- Spot anomalies (e.g., unusual job type volume)
- Track job success rates by type
Query 7: Backup and Restore Progress
Track progress of backup/restore jobs:
-- Active backup/restore jobs with progress
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS running_for,
ROUND(COALESCE(fraction_completed, 0) * 100, 2) AS percent_complete,
CASE
WHEN fraction_completed > 0 AND fraction_completed < 1 THEN
((now() - created) / fraction_completed) - (now() - created)
ELSE NULL
END AS estimated_time_remaining,
running_status
FROM j
WHERE status = 'running'
AND job_type IN ('BACKUP', 'RESTORE')
ORDER BY created
LIMIT 50;
Key columns:
percent_complete: Progress percentage (0-100)estimated_time_remaining: Rough estimate based on current progress raterunning_status: Detailed status (e.g., “performing backup to s3://…”)
Note: fraction_completed may be NULL for some job types or early in execution.
Common Workflows
Workflow 1: Schema Change Stuck Investigation
Scenario: User reports ALTER TABLE or CREATE INDEX appears stuck.
- Check for running schema changes:
WITH j AS (SHOW JOBS) SELECT job_id, description, created, now() - created AS running_for, fraction_completed, running_status FROM j WHERE status = 'running' AND job_type IN ('SCHEMA CHANGE', 'NEW SCHEMA CHANGE') ORDER BY created; - Identify MVCC GC waits:
-- Use Query 4 to find "waiting for MVCC GC" jobs - Interpret results:
- If
running_status= “waiting for MVCC GC”: Normal for post-DROP cleanup (wait up togc.ttlseconds) - If long-running with low
fraction_completed: Check for contention, large table size, or resource constraints - If failed: Check
errorcolumn for specific failure reason
- If
- Next steps:
- MVCC GC wait: Verify the effective
gc.ttlsecondswithSHOW ZONE CONFIGURATION FOR TABLE/DATABASE/RANGE ...against the affected object and wait - Resource constraints: Check cluster CPU/memory usage
- Failed job: Address error (permissions, constraints) and retry operation
- MVCC GC wait: Verify the effective
Workflow 2: Failed Backup Triage
Scenario: Scheduled backup job failed.
- Find recent failed backups:
-- Use Query 1 filtered for BACKUP job type WITH j AS (SHOW JOBS) SELECT job_id, description, created, finished, error FROM j WHERE status = 'failed' AND job_type = 'BACKUP' AND created > now() - INTERVAL '24 hours' ORDER BY created DESC; - Analyze error messages:
- “permission denied”: Check external storage credentials
- “timeout”: Network connectivity to backup destination
- “no space left”: Destination storage full
- “connection refused”: External storage endpoint unreachable
- Verify backup destination:
-- Check SHOW BACKUP for successful backups to same destination SHOW BACKUP 's3://bucket/path'; - Remediate and retry:
- Fix underlying issue (credentials, storage, network)
- Re-run backup command
- Monitor with Query 7 for progress
Workflow 3: Automatic Job Health Check
Scenario: Proactive monitoring of automatic background jobs.
- Check AUTO CREATE STATS frequency:
-- Use Query 5 to see recent automatic statistics jobs SELECT job_type, status, COUNT(*) AS job_count, MAX(created) AS most_recent FROM [SHOW AUTOMATIC JOBS] WHERE created > now() - INTERVAL '24 hours' AND job_type = 'AUTO CREATE STATS' GROUP BY job_type, status; - Expected pattern:
- Multiple successful AUTO CREATE STATS jobs per day (depends on table update frequency)
- Regular AUTO SQL STATS COMPACTION (typically once per hour)
- Warning signs:
- No AUTO CREATE STATS in last 24h: Statistics collection may be disabled
- High failure rate: Check cluster resource constraints or permission issues
- No AUTO SQL STATS COMPACTION: Stats table may grow unbounded
- Verify settings:
SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled; -- Should be true SHOW CLUSTER SETTING sql.stats.automatic_collection.min_stale_rows;
Workflow 4: Long-Running Job Monitoring
Scenario: Track progress of expected long-running operations.
- Identify long-running jobs:
-- Use Query 2 with custom threshold WITH j AS (SHOW JOBS) SELECT job_id, job_type, description, now() - created AS running_for, fraction_completed FROM j WHERE status = 'running' AND created < now() - INTERVAL '30 minutes' ORDER BY created; - Monitor progress over time:
-- Re-run every 10-15 minutes, track fraction_completed changes -- Example: 0.25 → 0.40 → 0.55 indicates steady progress - Estimate completion:
-- Use Query 7 for backup/restore jobs with time estimates - Decide on action:
- Steady progress: Continue monitoring
- Stalled progress (fraction_completed not increasing): Investigate with triaging-live-sql-activity
- Failed: Use Query 1 to check error
Safety Considerations
Read-only operations (all diagnostic queries):
All SHOW JOBS and SHOW AUTOMATIC JOBS queries are read-only and safe to run in production. No performance impact on cluster operations.
Job control operations (opt-in):
CAUTION: Pausing or canceling jobs can have data integrity implications
Only proceed with job control if:
- You have
CONTROLJOBrole option - You understand the implications (e.g., canceling a schema change mid-execution may require manual cleanup)
- You have authorization to interrupt cluster operations
- You’ve verified the job is truly problematic (not just slow)
Job control commands:
-- Pause a running job (can be resumed later)
PAUSE JOB <job_id>;
-- Resume a paused job
RESUME JOB <job_id>;
-- Cancel a job (terminal - cannot be resumed)
CANCEL JOB <job_id>;
Risks by job type:
- SCHEMA CHANGE: Canceling may leave schema in inconsistent state; prefer PAUSE and investigation
- BACKUP: Canceling is safe (can retry); pausing is better for temporary issues
- RESTORE: Canceling may leave database partially restored; requires cleanup
- AUTO CREATE STATS: Canceling is safe (will retry later automatically)
Best practice: Focus on monitoring and diagnosis; only use control operations when explicitly required and authorized.
See permissions reference for CONTROLJOB role option setup.
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
SHOW JOBS returns empty |
No jobs in last 12h, or insufficient privileges | Grant VIEWJOB privilege; verify cluster has recent job activity |
| “waiting for MVCC GC” for many hours | Normal behavior for SCHEMA CHANGE GC after DROP operations | Wait up to gc.ttlseconds (default 4h); check the effective value with SHOW ZONE CONFIGURATION FOR ... against the affected table/database/range |
| Can’t pause/resume job: “permission denied” | Missing CONTROLJOB role option |
Use ALTER ROLE <username> WITH CONTROLJOB (not GRANT SYSTEM) |
Job stuck at same fraction_completed |
Job may be processing large batch, or actually stuck | Wait 15-30 min and re-check; if no change, investigate with live query triage |
| No AUTO CREATE STATS jobs | Automatic collection disabled | Check sql.stats.automatic_collection.enabled = true |
SHOW AUTOMATIC JOBS shows old jobs only |
Need to filter by time window | Add WHERE created > now() - INTERVAL '24 hours' |
| Failed job with “schema change GC” error | Expected for post-DROP cleanup failures | Usually safe to ignore; job will retry automatically |
| Job error: “concurrent schema change” | Multiple schema changes on same table | Wait for first schema change to complete, then retry |
Key Considerations
- MVCC GC waiting: Normal for post-DROP cleanup; duration tied to the zone-level
gc.ttlseconds(default 4h) - Automatic job health: Regular AUTO CREATE STATS is critical for query optimizer performance
- Job control: PAUSE is safer than CANCEL; some cancellations require manual cleanup
- Progress:
fraction_completedmay be NULL for some job types
References
Skill references:
Related skills:
- triaging-live-sql-activity - For live query monitoring (job-executed statements may not appear)
- profiling-statement-fingerprints - For historical query analysis (background jobs excluded)
Official CockroachDB Documentation: