NVIDIA NIM · Rate Limits

Nvidia Nim Rate Limits

Name: Nvidia Nim Rate Limits
Creator: NVIDIA NIM
Keywords: AI, Artificial Intelligence, Inference, Microservices, LLM, Foundation Models, GPU, Kubernetes, NVIDIA, OpenAI Compatible

Nvidia Nim Rate Limits is the machine-readable rate-limit profile for NVIDIA NIM on the APIs.io network, conforming to the API Commons Rate Limits specification.

The profile also includes 7 backoff/retry policies defined.

Tagged areas include AI, Artificial Intelligence, Inference, Microservices, and LLM.

0 Limits

AIArtificial IntelligenceInferenceMicroservicesLLMFoundation ModelsGPUKubernetesNVIDIAOpenAI Compatible

Policies

Hosted Developer RPM

Free developer tier soft rate limit on /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/ranking.

Hosted Developer Credits

1,000 free inference credits granted on NVIDIA Developer Program signup, consumed across all hosted models.

Hosted Concurrent Requests

Concurrent in-flight requests against the shared hosted endpoint.

SSE Keepalive

Streaming connections idle longer than ~60s may be closed by the gateway.

Per-request Output Cap

Per-request output token cap on the hosted endpoint (model-dependent; some models support higher).

Hosted Input Size

Maximum input tokens on the hosted endpoint (model-dependent — long-context Llama 3.1/3.3 and Nemotron variants accept up to ~128K).

Self-hosted Container Concurrency

Self-hosted NIM containers accept as many concurrent requests as the GPU and TensorRT-LLM batching configuration allow. Operators tune via NIM_MAX_BATCH_SIZE and similar env vars.