Amazon Sagemaker Rate Limits

Name: Amazon Sagemaker Rate Limits
Creator: Amazon SageMaker
Keywords: Rate Limiting, Machine Learning, SageMaker

Amazon SageMaker exposes a control-plane API (CreateTrainingJob, CreateEndpoint, etc.) that follows AWS API throttling per account/region, plus a runtime InvokeEndpoint surface whose throughput scales with the underlying instance count and instance type. Endpoint-specific quotas (concurrent invocations, payload size, timeout) are configurable. ServiceQuotas governs the maximum number and type of instances per account.

Amazon Sagemaker Rate Limits is the machine-readable rate-limit profile for Amazon SageMaker on the APIs.io network, conforming to the API Commons Rate Limits specification.

It captures 5 rate-limit definitions, measuring varies, requests_per_second, bytes, seconds, and count.

The profile also includes 4 backoff/retry policies defined and response codes documented for throttled, quotaExceeded, and serviceUnavailable.

Tagged areas include Rate Limiting, Machine Learning, and SageMaker.

5 Limits Throttle: 400 Quota: 400

Rate LimitingMachine LearningSageMaker

Limits

SageMaker control-plane API account/region

varies

see Service Quotas console for SageMaker

Standard AWS API throttling envelope.

InvokeEndpoint (real-time) endpoint

requests_per_second

scales with instance count and type

Default soft limit per endpoint; configure auto-scaling on the production variant. Payload up to 6 MB synchronous, 1 GB asynchronous.

InvokeEndpoint payload size endpoint

bytes

6291456

6 MB max synchronous payload; use AsynchronousInferenceConfig for larger payloads (up to 1 GB).

Synchronous invocation timeout endpoint

seconds · second

Default 60s; can be raised on async endpoints up to 1 hour.

ML instances per type per region account/region

count

see Service Quotas console for SageMaker

Soft limits; raise via Service Quotas before training/deploying at scale.

Policies

Backoff with jitter

AWS SDKs default to standard retry mode (truncated exponential backoff with jitter, max 20s, 3 attempts).

Auto-scaling

Configure target-tracking scaling on production variants (InvocationsPerInstance) to absorb load.

Quota increases

ML instance counts, training-job concurrency, and notebook quotas are all soft limits; raise via Service Quotas before campaigns.

Async inference for large payloads

Use SageMaker Asynchronous Inference for payloads >6 MB or processing >60s, queuing requests to a SageMaker-managed S3 location.

Amazon Sagemaker Rate Limits

Limits

Policies

Sources