Ollama · Rate Limits

Ollama Rate Limits

Name: Ollama Rate Limits
Creator: Ollama
Keywords: Artificial Intelligence, Large Language Models, Models, Rate Limiting

Local Ollama (http://localhost:11434) has no rate limits or authentication. Ollama Cloud enforces tier-based concurrency (Free, Pro=3, Max=10 concurrent cloud models) and weekly GPU-time quotas rather than per-second request ceilings. Cloud quotas reset on 5-hour session and 7-day weekly cycles. Specific TPS / RPM ceilings are not publicly documented.

Ollama Rate Limits is the machine-readable rate-limit profile for Ollama on the APIs.io network, conforming to the API Commons Rate Limits specification.

It captures 5 rate-limit definitions, measuring requests_per_second, concurrent_requests, and gpu_time.

The profile also includes 4 backoff/retry policies defined and response codes documented for unauthorized, throttled, and serviceUnavailable.

Tagged areas include Artificial Intelligence, Large Language Models, Models, and Rate Limiting.

5 Limits Throttle: 429

Artificial IntelligenceLarge Language ModelsModelsRate Limiting

Limits

Local server localhost

requests_per_second

unlimited (bounded by local hardware)

No auth, no rate limiting on http://localhost:11434.

Cloud — concurrent models (Free) account

concurrent_requests

Free tier permits running cloud models with basic limits.

Cloud — concurrent models (Pro) account

concurrent_requests

Pro tier permits 3 cloud models at a time.

Cloud — concurrent models (Max) account

concurrent_requests

Max tier permits 10 cloud models at a time.

Cloud — weekly GPU usage account

gpu_time · week

tier-dependent (Free baseline, Pro = 50x Free, Max = 5x Pro)

Quota resets on 5-hour session and 7-day weekly windows.

Policies

Authentication

Local server requires no auth. Cloud requires an ollama.com account; pass the API key via the OLLAMA_API_KEY env var or Authorization Bearer header.

Backoff

On 429 from cloud, back off and retry; honor Retry-After if present.

Privacy

Ollama states cloud prompt/response data is never logged or trained on; cloud inference runs primarily in US data centers, with EU/Singapore routing for capacity.

Hybrid scheduling

Cloud models can be invoked transparently from a local Ollama instance via signed-in cloud access; routing to localhost vs ollama.com is selected per-request by model name.

Ollama Rate Limits

Limits

Policies

Sources