Scalable Inference Serving

A collection of APIs, frameworks, and platforms for scalable machine learning model inference serving, deployment, and management. This includes the KServe Open Inference Protocol (the CNCF standard for model serving on Kubernetes), BentoML (developer packaging and serving), vLLM (high-throughput LLM inference), NVIDIA Triton Inference Server, and supporting observability and registry tools. KServe recently joined CNCF as an incubating project (November 2025).

6 APIs 0 Features

AICNCFDeploymentInferenceKubernetesLLMMachine LearningModel ServingMLOpsScalability

APIs

KServe Open Inference Protocol API

KServe implements the Open Inference Protocol (OIP), also known as the KServe V2 Inference Protocol, which provides a standardized REST and gRPC interface for model inference ac...

BentoML REST API

BentoML is an open-source unified inference platform for deploying and scaling AI models. It auto-generates RESTful APIs from Python service definitions, provides built-in OpenA...

vLLM OpenAI-Compatible API

vLLM is a high-throughput and memory-efficient inference engine for LLMs, implementing PagedAttention for efficient KV cache management. vLLM exposes an OpenAI-compatible REST A...

NVIDIA Triton Inference Server HTTP API

NVIDIA Triton Inference Server is an open-source inference serving software that implements the KServe Open Inference Protocol (V2). Supports TensorRT, ONNX, TensorFlow, PyTorch...

MLflow Model Registry REST API

MLflow is an open source platform for managing the ML lifecycle, including experiment tracking, reproducibility, and deployment. The MLflow REST API manages experiments, runs, m...

Ray Serve REST API

Ray Serve is a scalable model serving library built on Ray, designed for building online inference APIs. Supports composable deployments, autoscaling, HTTP ingress, gRPC, WebSoc...

Collections

KServe Open Inference Protocol API

OPEN

Pricing Plans

Scalable Inference Serving Plans Pricing

1 plans

PLANS

Rate Limits

Scalable Inference Serving Rate Limits

1 limits

RATE LIMITS

FinOps

Scalable Inference Serving Finops

FINOPS

Semantic Vocabularies

Scalable Inference Serving Context

12 classes · 11 properties

JSON-LD

API Governance Rules

Scalable Inference Serving API Rules

17 rules · 5 errors 9 warnings 3 info

SPECTRAL

JSON Structure

Kserve Inference Request Structure

0 properties

JSON STRUCTURE

Scalable Inference Serving Structure

0 properties

GitHubOrganization

Sources

opencollection: 1.0.0
info:
  name: KServe Open Inference Protocol API
  version: v2
items:
- info:
    name: Health
    type: folder
  items:
  - info:
      name: Check Server Liveness
      type: http
    http:
      method: GET
      url: https://inference.kserve.example.com/v2/health/live
    docs: The server liveness API indicates if the inference server is able to receive and respond to metadata and inference
      requests. Can be used directly to implement the Kubernetes livenessProbe.
  - info:
      name: Check Server Readiness
      type: http
    http:
      method: GET
      url: https://inference.kserve.example.com/v2/health/ready
    docs: The server readiness API indicates if all the models are ready for inferencing. Can be used directly to implement
      the Kubernetes readinessProbe.
  - info:
      name: Check Model Readiness
      type: http
    http:
      method: GET
      url: https://inference.kserve.example.com/v2/models/:model_name/ready
      params:
      - name: model_name
        value: bert-sentiment-classifier
        type: path
        description: Name of the model to check readiness for.
    docs: The model readiness API indicates if a specific model is ready for inferencing. Check this before submitting inference
      requests to a newly deployed model.
  - info:
      name: Check Model Version Readiness
      type: http
    http:
      method: GET
      url: https://inference.kserve.example.com/v2/models/:model_name/versions/:model_version/ready
      params:
      - name: model_name
        value: bert-sentiment-classifier
        type: path
      - name: model_version
        value: '2'
        type: path
    docs: Check if a specific version of a model is ready for inference.
- info:
    name: Metadata
    type: folder
  items:
  - info:
      name: Get Server Metadata
      type: http
    http:
      method: GET
      url: https://inference.kserve.example.com/v2
    docs: Returns metadata about the inference server, including its name, version, and the extensions it supports.
  - info:
      name: Get Model Metadata
      type: http
    http:
      method: GET
      url: https://inference.kserve.example.com/v2/models/:model_name
      params:
      - name: model_name
        value: resnet50-image-classifier
        type: path
        description: Name of the model.
    docs: Returns metadata about a model, including its name, versions, platform, inputs, and outputs. Use this to discover
      the input/output tensor shapes and data types before submitting inference requests.
  - info:
      name: Get Model Version Metadata
      type: http
    http:
      method: GET
      url: https://inference.kserve.example.com/v2/models/:model_name/versions/:model_version
      params:
      - name: model_name
        value: resnet50-image-classifier
        type: path
      - name: model_version
        value: '2'
        type: path
    docs: Returns metadata for a specific version of a model.
- info:
    name: Inference
    type: folder
  items:
  - info:
      name: Run Model Inference
      type: http
    http:
      method: POST
      url: https://inference.kserve.example.com/v2/models/:model_name/infer
      params:
      - name: model_name
        value: bert-sentiment-classifier
        type: path
        description: Name of the model to run inference against.
      body:
        type: json
        data: '{}'
    docs: 'Submit an inference request to a model. The request body contains the input tensors as JSON. Inputs are specified
      as an array of named tensors with shape, datatype, and data fields. The response contains the model''s output tensors.

      For large tensor payloads, the binary tensor data extension allows sending tensor data as binary in the request body
      alongside the JSON header.'
  - info:
      name: Run Model Version Inference
      type: http
    http:
      method: POST
      url: https://inference.kserve.example.com/v2/models/:model_name/versions/:model_version/infer
      params:
      - name: model_name
        value: resnet50-image-classifier
        type: path
      - name: model_version
        value: '2'
        type: path
      body:
        type: json
        data: '{}'
    docs: Submit an inference request to a specific version of a model. Useful for A/B testing, canary rollouts, and version-pinned
      integrations.
bundled: true