Cerebrium · Rate Limits

Cerebrium Rate Limits

Name: Cerebrium Rate Limits
Creator: Cerebrium
Keywords: AI, GPU, Serverless, Inference, ML Infrastructure, Rate Limiting, Quotas, Throttling

Cerebrium does not throttle deployed function endpoints with classic per-minute request quotas; instead, throughput is governed by autoscaling GPU/CPU concurrency limits per app and per account plan tier. The number of concurrent replicas an app can scale to is configured per deployment (min/max instances) and capped by the plan tier, with Enterprise offering unlimited GPU concurrency. Async runs are bounded by a maximum execution window (up to 12 hours) and a configurable response grace period. Specific per-account concurrency caps are not reconciled in this artifact.

Cerebrium Rate Limits is the machine-readable rate-limit profile for Cerebrium on the APIs.io network, conforming to the API Commons Rate Limits specification.

It captures 4 rate-limit definitions, measuring instances, gpu_instances, and seconds.

The profile also includes 3 backoff/retry policies defined and response codes documented for throttled.

Tagged areas include AI, GPU, Serverless, Inference, and ML Infrastructure.

4 Limits Throttle: 429

AIGPUServerlessInferenceML InfrastructureRate LimitingQuotasThrottling

Limits

Concurrent Replicas (per app) app

instances

see provider documentation

Configured via min/max instances in cerebrium.toml; capped by plan tier.

GPU Concurrency (per account) account

gpu_instances

see provider documentation

Bounded by plan tier; Enterprise offers unlimited GPU concurrency.

Async Execution Window request

seconds

up to 12 hours

Async runs are bounded by response_grace_period (default 15 minutes, up to 12 hours).

Cold Start / Scale-to-Zero app

instances

scales to zero when idle

Apps scale down to zero idle instances; cold-start latency applies on first request after idle.

Policies

Tiered Concurrency

Concurrency caps raise as accounts move from Hobby to Standard to Enterprise (unlimited GPU concurrency).

Autoscaling

Apps autoscale replicas between configured min and max based on incoming traffic.

Backoff Strategy

Clients should implement exponential backoff with jitter and honor Retry-After on any throttling responses.

Cerebrium Rate Limits

Limits

Policies

Sources