Inferless · Rate Limits

Inferless Rate Limits

Name: Inferless Rate Limits
Creator: Inferless
Keywords: AI, ML Inference, Serverless GPU, Model Deployment, Inference, Rate Limiting, Quotas, Throttling

Inferless throughput is governed by per-model autoscaling configuration rather than by a fixed account-wide requests-per-minute quota. Each deployed model has a minimum and maximum replica count, a container concurrency setting (requests served per replica), and an inference timeout. Effective capacity is max_replica x container_concurrency; requests beyond live capacity queue or wait for autoscaling. Enterprise plans raise GPU concurrency (e.g. 50). Specific per-account ceilings are not reconciled in this artifact.

Inferless Rate Limits is the machine-readable rate-limit profile for Inferless on the APIs.io network, conforming to the API Commons Rate Limits specification.

It captures 5 rate-limit definitions, measuring replicas, concurrent_requests, concurrent_gpus, and seconds.

The profile also includes 2 backoff/retry policies defined and response codes documented for throttled.

Tagged areas include AI, ML Inference, Serverless GPU, Model Deployment, and Inference.

5 Limits Throttle: 429

AIML InferenceServerless GPUModel DeploymentInferenceRate LimitingQuotasThrottling

Limits

Max Replicas model

replicas

configurable per model (max_replica)

Upper bound on concurrently running GPU replicas for a deployed model.

Min Replicas model

replicas

configurable per model (min_replica, may be 0)

Setting min_replica to 0 enables scale-to-zero; no compute charge when idle.

Container Concurrency model

concurrent_requests

configurable per model (container_concurrency)

Number of simultaneous requests served per replica before scaling out.

GPU Concurrency account

concurrent_gpus

plan-dependent (e.g. 50 on Enterprise)

Account-level ceiling on concurrent GPU instances across deployments.

Inference Timeout model

seconds

configurable per model (inference_time)

Maximum duration a single inference request may run before timing out.

Policies

Autoscaling

Capacity scales between min and max replicas based on live traffic; scale-down occurs after a configurable scale_down_delay.

Backoff Strategy

Clients should implement exponential backoff with jitter and honor Retry-After on 429 responses while replicas scale up.

Inferless Rate Limits

Limits

Policies

Sources