Together AI · Rate Limits

Together Ai Rate Limits

Name: Together Ai Rate Limits
Creator: Together AI
Keywords: AI, LLM, Inference, Open Source, Fine-tuning, Rate Limiting, Quotas, Throttling

Together AI enforces per-account rate limits on serverless inference that vary by model and account tier (Build / Scale / Enterprise as account spend/credit grows). Limits include requests-per-minute (RPM) and tokens-per-minute (TPM) per model. Specific per-model values are not reconciled in this artifact - see the Together console for active limits on your account.

Together Ai Rate Limits is the machine-readable rate-limit profile for Together AI on the APIs.io network, conforming to the API Commons Rate Limits specification.

It captures 5 rate-limit definitions, measuring requests, tokens, and jobs.

The profile also includes 2 backoff/retry policies defined and response codes documented for throttled.

Tagged areas include AI, LLM, Inference, Open Source, and Fine-tuning.

5 Limits Throttle: 429

AILLMInferenceOpen SourceFine-tuningRate LimitingQuotasThrottling

Limits

Requests Per Minute (RPM) account

requests

see provider documentation

Per-model RPM, varies by tier and model. Pending reconciliation.

Tokens Per Minute (TPM) account

tokens

see provider documentation

Per-model TPM, varies by tier and model. Pending reconciliation.

Concurrent Fine-Tuning Jobs account

jobs

see provider documentation

Concurrency cap on parallel fine-tuning jobs.

Batch Job Size / Concurrency account

jobs

see provider documentation

Batch jobs are queued and do not consume serverless RPM/TPM directly.

Dedicated Endpoints endpoint

requests

bounded by provisioned GPU capacity

Throughput is determined by the dedicated hardware sizing.

Policies

Tiered Limits

Limits scale up automatically with account spend / credit balance and via Enterprise agreements.

Backoff Strategy

Clients should implement exponential backoff with jitter and honor any Retry-After header.

Together Ai Rate Limits

Limits

Policies

Sources