Cerebrium Rate Limits
Cerebrium does not throttle deployed function endpoints with classic per-minute request quotas; instead, throughput is governed by autoscaling GPU/CPU concurrency limits per app and per account plan tier. The number of concurrent replicas an app can scale to is configured per deployment (min/max instances) and capped by the plan tier, with Enterprise offering unlimited GPU concurrency. Async runs are bounded by a maximum execution window (up to 12 hours) and a configurable response grace period. Specific per-account concurrency caps are not reconciled in this artifact.
Cerebrium Rate Limits is the machine-readable rate-limit profile for Cerebrium on the APIs.io network, conforming to the API Commons Rate Limits specification.
It captures 4 rate-limit definitions, measuring instances, gpu_instances, and seconds.
The profile also includes 3 backoff/retry policies defined and response codes documented for throttled.
Tagged areas include AI, GPU, Serverless, Inference, and ML Infrastructure.