Scalable Inference Serving logo

Scalable Inference Serving

A collection of APIs, frameworks, and platforms for scalable machine learning model inference serving, deployment, and management. This includes the KServe Open Inference Protocol (the CNCF standard for model serving on Kubernetes), BentoML (developer packaging and serving), vLLM (high-throughput LLM inference), NVIDIA Triton Inference Server, and supporting observability and registry tools. KServe recently joined CNCF as an incubating project (November 2025).

6 APIs 0 Features
AICNCFDeploymentInferenceKubernetesLLMMachine LearningModel ServingMLOpsScalability

APIs

KServe Open Inference Protocol API

KServe implements the Open Inference Protocol (OIP), also known as the KServe V2 Inference Protocol, which provides a standardized REST and gRPC interface for model inference ac...

BentoML REST API

BentoML is an open-source unified inference platform for deploying and scaling AI models. It auto-generates RESTful APIs from Python service definitions, provides built-in OpenA...

vLLM OpenAI-Compatible API

vLLM is a high-throughput and memory-efficient inference engine for LLMs, implementing PagedAttention for efficient KV cache management. vLLM exposes an OpenAI-compatible REST A...

NVIDIA Triton Inference Server HTTP API

NVIDIA Triton Inference Server is an open-source inference serving software that implements the KServe Open Inference Protocol (V2). Supports TensorRT, ONNX, TensorFlow, PyTorch...

MLflow Model Registry REST API

MLflow is an open source platform for managing the ML lifecycle, including experiment tracking, reproducibility, and deployment. The MLflow REST API manages experiments, runs, m...

Ray Serve REST API

Ray Serve is a scalable model serving library built on Ray, designed for building online inference APIs. Supports composable deployments, autoscaling, HTTP ingress, gRPC, WebSoc...

Semantic Vocabularies

Scalable Inference Serving Context

12 classes · 11 properties

JSON-LD

API Governance Rules

Scalable Inference Serving API Rules

17 rules · 5 errors 9 warnings 3 info

SPECTRAL

Resources

🔑
Authentication
Authentication
🚀
GettingStarted
GettingStarted
👥
GitHubOrganization
GitHubOrganization
🔗
CNCF Landscape
CNCF Landscape
📰
Blog
Blog
🔗
OpenAPI
OpenAPI
🔗
SpectralRuleset
SpectralRuleset
🔗
JSONSchema
JSONSchema
🔗
JSONSchema
JSONSchema
🔗
JSONLD
JSONLD
🔗
Vocabulary
Vocabulary

Sources

Raw ↑
name: Scalable Inference Serving
description: >-
  A collection of APIs, frameworks, and platforms for scalable machine learning model inference serving, deployment, and
  management. This includes the KServe Open Inference Protocol (the CNCF standard for model serving on Kubernetes),
  BentoML (developer packaging and serving), vLLM (high-throughput LLM inference), NVIDIA Triton Inference Server, and
  supporting observability and registry tools. KServe recently joined CNCF as an incubating project (November 2025).
image: https://kserve.github.io/website/images/KServe.png
url: https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/refs/heads/main/apis.yml
created: '2024-01-01'
modified: '2026-05-19'
specificationVersion: '0.18'
tags:
  - AI
  - CNCF
  - Deployment
  - Inference
  - Kubernetes
  - LLM
  - Machine Learning
  - Model Serving
  - MLOps
  - Scalability
apis:
  - name: KServe Open Inference Protocol API
    description: >-
      KServe implements the Open Inference Protocol (OIP), also known as the KServe V2 Inference Protocol, which
      provides a standardized REST and gRPC interface for model inference across frameworks. KServe is a standardized
      distributed generative and predictive AI inference platform for scalable, multi-framework deployment on
      Kubernetes. CNCF incubating project since November 2025. Supports TensorFlow, PyTorch, scikit-learn, XGBoost,
      ONNX, vLLM, and HuggingFace.
    image: https://kserve.github.io/website/images/KServe.png
    humanUrl: https://kserve.github.io/website/
    baseUrl: https://inference.kserve.example.com
    tags:
      - CNCF
      - Inference
      - Kubernetes
      - Model Serving
      - Open Inference Protocol
      - Open Source
    properties:
      - type: Documentation
        url: https://kserve.github.io/website/docs/intro
      - type: OpenAPI
        url: >-
          https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/main/openapi/kserve-open-inference-protocol-openapi.yml
      - type: GitHub
        url: https://github.com/kserve/kserve
      - type: ChangeLog
        url: https://github.com/kserve/kserve/releases
      - type: GettingStarted
        url: https://kserve.github.io/website/docs/get_started/
      - type: SwaggerUI
        url: https://kserve.github.io/website/latest/reference/swagger-ui/
    contact:
      - type: Slack
        url: https://kubernetes.slack.com/archives/CH6E58LNP
      - type: GitHub Issues
        url: https://github.com/kserve/kserve/issues
  - name: BentoML REST API
    description: >-
      BentoML is an open-source unified inference platform for deploying and scaling AI models. It auto-generates
      RESTful APIs from Python service definitions, provides built-in OpenAPI/Swagger documentation, supports adaptive
      batching, and integrates with KServe for Kubernetes deployment. BentoML 1.0 introduced the Runner abstraction for
      parallelizing inference workloads with adaptive batching and independent scaling of pre/post-processing from model
      inference.
    image: https://www.bentoml.com/favicon.ico
    humanUrl: https://www.bentoml.com/
    baseUrl: https://api.bentoml.example.com
    tags:
      - Batching
      - Inference
      - Model Serving
      - Open Source
      - Python
      - REST API
    properties:
      - type: Documentation
        url: https://docs.bentoml.com/en/latest/
      - type: GitHub
        url: https://github.com/bentoml/BentoML
      - type: GettingStarted
        url: https://docs.bentoml.com/en/latest/get-started/quickstart.html
      - type: Pricing
        url: https://www.bentoml.com/pricing
      - type: APIReference
        url: https://docs.bentoml.com/en/latest/reference/index.html
    contact:
      - type: Community
        url: https://l.bentoml.com/join-slack
      - type: GitHub Issues
        url: https://github.com/bentoml/BentoML/issues
  - name: vLLM OpenAI-Compatible API
    description: >-
      vLLM is a high-throughput and memory-efficient inference engine for LLMs, implementing PagedAttention for
      efficient KV cache management. vLLM exposes an OpenAI-compatible REST API allowing seamless migration from OpenAI
      endpoints. In 2026, vLLM integrates with KServe via LLMInferenceService and llm-d for production-grade distributed
      LLM inference. Powers major LLM deployments at scale.
    image: https://docs.vllm.ai/en/stable/_static/logo/vllm-logo-text-light.png
    humanUrl: https://docs.vllm.ai/
    baseUrl: https://vllm.example.com/v1
    tags:
      - GPU
      - Inference
      - KV Cache
      - LLM
      - Model Serving
      - Open Source
      - OpenAI-Compatible
    properties:
      - type: Documentation
        url: https://docs.vllm.ai/en/stable/
      - type: GitHub
        url: https://github.com/vllm-project/vllm
      - type: APIReference
        url: https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html
      - type: ChangeLog
        url: https://github.com/vllm-project/vllm/releases
    contact:
      - type: GitHub Issues
        url: https://github.com/vllm-project/vllm/issues
      - type: Slack
        url: https://vllm-dev.slack.com/
  - name: NVIDIA Triton Inference Server HTTP API
    description: >-
      NVIDIA Triton Inference Server is an open-source inference serving software that implements the KServe Open
      Inference Protocol (V2). Supports TensorRT, ONNX, TensorFlow, PyTorch, and Python backends. Provides dynamic
      batching, model ensembles, model analyzers, and GPU/CPU inference. Used extensively in production ML pipelines
      requiring maximum throughput.
    image: https://developer.nvidia.com/favicon.ico
    humanUrl: https://developer.nvidia.com/triton-inference-server
    baseUrl: https://triton.example.com
    tags:
      - GPU
      - Inference
      - Model Serving
      - NVIDIA
      - Open Source
      - TensorRT
      - Triton
    properties:
      - type: Documentation
        url: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/
      - type: GitHub
        url: https://github.com/triton-inference-server/server
      - type: GettingStarted
        url: https://github.com/triton-inference-server/tutorials
      - type: APIReference
        url: >-
          https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/customization_guide/inference_protocols.html
    contact:
      - type: GitHub Issues
        url: https://github.com/triton-inference-server/server/issues
      - type: Forums
        url: https://forums.developer.nvidia.com/c/ai-data-science/deep-learning/triton-inference-server/
  - name: MLflow Model Registry REST API
    description: >-
      MLflow is an open source platform for managing the ML lifecycle, including experiment tracking, reproducibility,
      and deployment. The MLflow REST API manages experiments, runs, metrics, parameters, artifacts, and the Model
      Registry for versioning and staging model deployments. CNCF-adjacent; used with KServe for model lifecycle
      management.
    image: https://mlflow.org/favicon.ico
    humanUrl: https://mlflow.org/
    baseUrl: https://mlflow.example.com/api/2.0
    tags:
      - Experiment Tracking
      - Machine Learning
      - Model Registry
      - MLOps
      - Open Source
      - Versioning
    properties:
      - type: Documentation
        url: https://mlflow.org/docs/latest/rest-api.html
      - type: GitHub
        url: https://github.com/mlflow/mlflow
      - type: GettingStarted
        url: https://mlflow.org/docs/latest/getting-started/intro-quickstart/
      - type: APIReference
        url: https://mlflow.org/docs/latest/rest-api.html
    contact:
      - type: Community
        url: https://github.com/mlflow/mlflow/discussions
      - type: GitHub Issues
        url: https://github.com/mlflow/mlflow/issues
  - name: Ray Serve REST API
    description: >-
      Ray Serve is a scalable model serving library built on Ray, designed for building online inference APIs. Supports
      composable deployments, autoscaling, HTTP ingress, gRPC, WebSockets, and request batching. Integrates with any ML
      framework. The Ray Serve dashboard and REST API manage deployments, replicas, routes, and application status.
    image: https://www.ray.io/favicon.ico
    humanUrl: https://docs.ray.io/en/latest/serve/index.html
    baseUrl: https://ray-serve.example.com
    tags:
      - Autoscaling
      - Inference
      - Machine Learning
      - Model Serving
      - Open Source
      - Python
      - Ray
    properties:
      - type: Documentation
        url: https://docs.ray.io/en/latest/serve/index.html
      - type: GitHub
        url: https://github.com/ray-project/ray
      - type: GettingStarted
        url: https://docs.ray.io/en/latest/serve/getting_started.html
      - type: APIReference
        url: https://docs.ray.io/en/latest/serve/api/index.html
    contact:
      - type: Community
        url: https://discuss.ray.io/
      - type: GitHub Issues
        url: https://github.com/ray-project/ray/issues
common:
  - type: Authentication
    url: https://kserve.github.io/website/docs/intro
  - type: GettingStarted
    url: https://kserve.github.io/website/docs/get_started/
  - type: GitHubOrganization
    url: https://github.com/kserve
  - type: CNCF Landscape
    url: https://landscape.cncf.io/card-mode?project=incubating
  - type: Blog
    url: https://kserve.github.io/website/blog/
  - type: OpenAPI
    url: >-
      https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/main/openapi/kserve-open-inference-protocol-openapi.yml
  - type: SpectralRuleset
    url: >-
      https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/main/rules/kserve-open-inference-protocol-rules.yml
  - type: JSONSchema
    url: >-
      https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/main/json-schema/kserve-inference-request-schema.json
  - type: JSONSchema
    url: >-
      https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/main/json-schema/kserve-model-metadata-schema.json
  - type: JSONLD
    url: >-
      https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/main/json-ld/scalable-inference-serving-context.jsonld
  - type: Vocabulary
    url: >-
      https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/main/vocabulary/scalable-inference-serving-vocabulary.yml
maintainers:
  - name: API Evangelist
    email: kin@apievangelist.com
    url: https://apievangelist.com