Triton Inference Server

NVIDIA Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs. Triton supports an HTTP/REST and gRPC protocol that allows remote clients to request inferencing for any model being managed by the server. Open source and part of the broader NVIDIA AI ecosystem, Triton implements the KServe V2 inference protocol supporting TensorRT, TensorFlow, PyTorch, ONNX Runtime, Python, and more backends.

3 APIs 0 Features

AIDeep LearningInferenceMachine LearningModel ServingNVIDIAOpen Source

APIs

Triton HTTP/REST API

RESTful API implementing the KServe V2 inference protocol for model inference, health checks, metadata queries, model repository management, statistics, tracing, and logging.

Triton GRPC API

High-performance gRPC API for model inference with support for streaming and binary tensor data.

Triton Metrics API

Prometheus-compatible metrics API for monitoring server and model performance including inference request counts, latencies, GPU utilization, and memory usage.

Collections

Triton Inference Server NVIDIA Triton Inference Server HTTP/REST API

OPEN

Triton Inference Server NVIDIA Triton Inference Server Metrics API

OPEN

Pricing Plans

Triton Plans Pricing

2 plans

PLANS

Rate Limits

Triton Rate Limits

2 limits

RATE LIMITS

FinOps

Triton Finops

FINOPS

Semantic Vocabularies

Triton Context

0 classes · 9 properties

JSON-LD

API Governance Rules

Triton Inference Server API Rules

8 rules · 1 errors 5 warnings 2 info

SPECTRAL

JSON Structure

Triton Model Structure

0 properties

JSON STRUCTURE

Example Payloads

Triton Model Infer Example

4 fields

EXAMPLE

Triton Repository Index Example

4 fields

EXAMPLE

Resources

GitHubRepository

GitHubRepository

Documentation

GettingStarted

Client Libraries

Client Libraries

Model Repository

Model Repository

Supported Backends

Supported Backends

Docker Images

Community Forum

Community Forum

ReleaseNotes

Model Analyzer

JSONStructure

SpectralRules

Sources

opencollection: 1.0.0
info:
  name: Triton Inference Server NVIDIA Triton Inference Server Metrics API
  version: '2.0'
items:
- info:
    name: Metrics
    type: folder
  items:
  - info:
      name: Triton Inference Server Get Prometheus metrics
      type: http
    http:
      method: GET
      url: http://localhost:8002/metrics
    docs: 'Retrieve all available metrics in Prometheus text exposition format. Includes server-level metrics (request counts,
      latencies, GPU utilization, memory usage) and per-model metrics (inference counts, queue times, compute times). Metrics
      are labeled with model name, version, GPU UUID, and other dimensions.


      Key metric families include:

      - `nv_inference_request_success` - Successful inference request count

      - `nv_inference_request_failure` - Failed inference request count

      - `nv_inference_count` - Tot'
bundled: true