Inferless · Rate Limits

Inferless Rate Limits

Inferless throughput is governed by per-model autoscaling configuration rather than by a fixed account-wide requests-per-minute quota. Each deployed model has a minimum and maximum replica count, a container concurrency setting (requests served per replica), and an inference timeout. Effective capacity is max_replica x container_concurrency; requests beyond live capacity queue or wait for autoscaling. Enterprise plans raise GPU concurrency (e.g. 50). Specific per-account ceilings are not reconciled in this artifact.

Inferless Rate Limits is the machine-readable rate-limit profile for Inferless on the APIs.io network, conforming to the API Commons Rate Limits specification.

It captures 5 rate-limit definitions, measuring replicas, concurrent_requests, concurrent_gpus, and seconds.

The profile also includes 2 backoff/retry policies defined and response codes documented for throttled.

Tagged areas include AI, ML Inference, Serverless GPU, Model Deployment, and Inference.

5 Limits Throttle: 429
AIML InferenceServerless GPUModel DeploymentInferenceRate LimitingQuotasThrottling

Limits

Max Replicas model
replicas
configurable per model (max_replica)
Upper bound on concurrently running GPU replicas for a deployed model.
Min Replicas model
replicas
configurable per model (min_replica, may be 0)
Setting min_replica to 0 enables scale-to-zero; no compute charge when idle.
Container Concurrency model
concurrent_requests
configurable per model (container_concurrency)
Number of simultaneous requests served per replica before scaling out.
GPU Concurrency account
concurrent_gpus
plan-dependent (e.g. 50 on Enterprise)
Account-level ceiling on concurrent GPU instances across deployments.
Inference Timeout model
seconds
configurable per model (inference_time)
Maximum duration a single inference request may run before timing out.

Policies

Autoscaling
Capacity scales between min and max replicas based on live traffic; scale-down occurs after a configurable scale_down_delay.
Backoff Strategy
Clients should implement exponential backoff with jitter and honor Retry-After on 429 responses while replicas scale up.

Sources