NVIDIA NIM · Rate Limits

Nvidia Nim Rate Limits

Nvidia Nim Rate Limits is the machine-readable rate-limit profile for NVIDIA NIM on the APIs.io network, conforming to the API Commons Rate Limits specification.

The profile also includes 7 backoff/retry policies defined.

Tagged areas include AI, Artificial Intelligence, Inference, Microservices, and LLM.

0 Limits
AIArtificial IntelligenceInferenceMicroservicesLLMFoundation ModelsGPUKubernetesNVIDIAOpenAI Compatible

Policies

Hosted Developer RPM
Free developer tier soft rate limit on /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/ranking.
Hosted Developer Credits
1,000 free inference credits granted on NVIDIA Developer Program signup, consumed across all hosted models.
Hosted Concurrent Requests
Concurrent in-flight requests against the shared hosted endpoint.
SSE Keepalive
Streaming connections idle longer than ~60s may be closed by the gateway.
Per-request Output Cap
Per-request output token cap on the hosted endpoint (model-dependent; some models support higher).
Hosted Input Size
Maximum input tokens on the hosted endpoint (model-dependent — long-context Llama 3.1/3.3 and Nemotron variants accept up to ~128K).
Self-hosted Container Concurrency
Self-hosted NIM containers accept as many concurrent requests as the GPU and TensorRT-LLM batching configuration allow. Operators tune via NIM_MAX_BATCH_SIZE and similar env vars.