Inferless Rate Limits
Inferless throughput is governed by per-model autoscaling configuration rather than by a fixed account-wide requests-per-minute quota. Each deployed model has a minimum and maximum replica count, a container concurrency setting (requests served per replica), and an inference timeout. Effective capacity is max_replica x container_concurrency; requests beyond live capacity queue or wait for autoscaling. Enterprise plans raise GPU concurrency (e.g. 50). Specific per-account ceilings are not reconciled in this artifact.
Inferless Rate Limits is the machine-readable rate-limit profile for Inferless on the APIs.io network, conforming to the API Commons Rate Limits specification.
It captures 5 rate-limit definitions, measuring replicas, concurrent_requests, concurrent_gpus, and seconds.
The profile also includes 2 backoff/retry policies defined and response codes documented for throttled.
Tagged areas include AI, ML Inference, Serverless GPU, Model Deployment, and Inference.