Agent Skill · NVIDIA NIM

tao-generate-referring-expressions

Four-step image referring-expression pipeline: turns images plus KITTI bounding-box labels into region descriptions, scene captions, grounded referring expressions, and (optionally) verified expressions via VLM distillation. Use when the user wants to generate referring-expression annotations from images with KITTI labels, build region descriptions, produce grouped grounding phrases tied to bboxes, run a double-check verification pass on grounding expressions, auto-label traffic / scene images for referring datasets, or run the image_referring_expression pipeline. Triggers include 'referring expression', 'region description', 'KITTI labels', 'spatial relationship annotation', 'auto-label image referring expression', 'image_referring_expression'.

Provider: NVIDIA NIM Path in repo: skills/tao-generate-referring-expressions/SKILL.md

Skill body

Image Referring Expression Pipeline

Generate referring-expression and grounding annotations from images with KITTI-format bounding box labels. A single VLM (Gemini or any OpenAI-compatible endpoint) runs four steps: per-object region descriptions, holistic image captions, grouped grounding expressions tied to bboxes, and an optional double-check verification pass.

Purpose

Transform (image, KITTI labels) pairs into a unified annotations.jsonl containing rich, grounded referring expressions. The VLM acts as a “teacher” annotator: Steps 0-1 see the image; Step 2 groups Step 0 outputs into grouping phrases with bbox lists; Step 3 (optional) re-examines those bboxes against the image and corrects mismatches.

Pipeline Architecture

Step 0: Region expression  ──┐
                              ├──▶  Step 2: Grounding expression  ──▶  [Step 3: Double check]
Step 1: Image caption  ──────┘                                                   (optional)

Steps 0 and 1 run in parallel within a single thread pool (they only depend on the seed records). Each step writes its own step_<N>_*/annotations.jsonl and skips already-processed images on re-run unless workflow.force_reprocess: true.

Instructions

Initial setup

When a user wants to run this pipeline, walk through these steps:

  1. Images: Ask for data.image_dir, the directory containing .jpg, .jpeg, or .png images.
  2. KITTI labels: Ask for data.kitti_label_dir, the directory containing one .txt label file per image. Each label line must use KITTI format: <type> <truncated> <occluded> <alpha> <bbox_left> <bbox_top> <bbox_right> <bbox_bottom> .... Lines with fewer than 8 fields are silently skipped. Set this even for Step 1-only runs because Steps 0 and 2 require it.
  3. Resume from existing annotations: If the user already has a unified annotations.jsonl from a previous run, set data.input_annotations_jsonl to that file instead of seeding from data.image_dir and data.kitti_label_dir.
  4. API access: Ask the user which VLM endpoint they want to use. Present these five options and act on the choice:
    1. Gemini — set vlm.backend: "gemini"; require GOOGLE_API_KEY (env var or vlm.gemini.api_key).
    2. NIM (e.g. https://inference-api.nvidia.com/v1) — set vlm.backend: "openai"; collect base_url, model_name, and api_key.
    3. TAO inference microservice (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
      • Running — collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
      • Not running — guide the user through the skills/applications/tao-run-inference-service skill, which stands up a local TAO inference microservice with an OpenAI-compatible API. Before promising a specific model, check skills/applications/tao-run-inference-service/references/service.yaml for valid_network_arch_config_basenames. Once the server is up, collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
    4. vLLM (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
      • Running — collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
      • Not running — follow references/vllm_server.md to install and launch a vLLM server, then collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
    5. Custom (any other OpenAI-compatible endpoint) — set vlm.backend: "openai"; collect base_url, model_name, and (optionally) api_key.

    If the user has no endpoint and does not want to set one up, stop and help resolve API access first.

  5. Workflow steps: Choose one of:
    • Full pipeline: ["0", "1", "2", "3"]
    • No caption generation: ["0", "2", "3"], where Step 2 falls back to image-only context
    • No verification: ["0", "1", "2"]
    • Custom subset: any supported subset of steps
  6. Output format: Choose one of:
    • jsonl: unified schema only
    • legacy: byte-compatible .txt.stepN files only
    • both: writes both formats and is the default for downstream tooling

Running the pipeline

The pipeline runs inside the TAO Toolkit container via the auto_label CLI:

auto_label generate -e /path/to/spec.yaml \
    results_dir=/results \
    image_referring_expression.data.image_dir=/data/images \
    image_referring_expression.data.kitti_label_dir=/data/labels \
    image_referring_expression.vlm.gemini.api_key=$GOOGLE_API_KEY

Generate a default spec: auto_label default_specs results_dir=/results module_name=auto_label, then set autolabel_type: "image_referring_expression". All fields support Hydra dot-notation overrides on the command line.

See references/configuration.md for the full YAML structure, all parameters, model/endpoint setup, and error patterns.

  1. Run on 5-10 images with all four steps.
  2. Inspect step_0_region_expr/annotations.jsonl — are object types, colors, and discriminating phrases accurate?
  3. Inspect step_2_grounding_expr/annotations.jsonl — are objects grouped sensibly, and do bbox coordinates match the described groups?
  4. Inspect step_3_double_check/annotations.jsonl — were mismatched bboxes removed or tightened? Are any new errors introduced (rare)?
  5. If quality is insufficient, switch the VLM to a stronger model (e.g. gemini-2.5-pro or a larger Qwen3-VL endpoint), raise media_resolution / max_output_tokens, then re-run with workflow.force_reprocess=true.
  6. Scale to the full dataset once satisfied.

Configuration

Key configuration fields (full reference in references/configuration.md):

Field Default Description
workflow.steps ["0","1","2","3"] Which steps to execute (0=region_expr, 1=image_caption, 2=grounding_expr, 3=double_check)
workflow.max_workers 4 Parallel threads per step (watch API rate limits)
workflow.force_reprocess false Ignore cached per-step outputs and reprocess from scratch
workflow.output_format "jsonl" (set to "both" in the default spec) "jsonl", "legacy", or "both"
vlm.backend "gemini" "gemini" or "openai" (OpenAI-compatible endpoint)
data.image_dir required Directory of input images (.jpg / .jpeg / .png)
data.kitti_label_dir required (unless resuming) Directory of KITTI-format .txt label files
data.input_annotations_jsonl "" Optional pre-seeded annotations.jsonl (skips KITTI seeding)

Inputs

Two ways to seed the pipeline:

  1. Image directory + KITTI labels (default). Set data.image_dir and data.kitti_label_dir. The orchestrator walks the image directory, reads the matching <stem>.txt KITTI file, parses bboxes (fields 0 + 4-7), reads each image’s width/height via PIL, and writes a seed_annotations.jsonl to results_dir/.
  2. Pre-seeded annotations JSONL (resume / pre-computed regions). Set data.input_annotations_jsonl to a file with one {"image_id", "image_path", "width", "height", "kitti_bboxes": [...]} object per line.

Outputs

All outputs go to results_dir/:

Prerequisites

Skill frontmatter

license: Apache-2.0 compatibility: Requires docker + nvidia-container-toolkit + at least one VLM endpoint (Gemini API key or OpenAI-compatible). metadata: {"author" => "NVIDIA Corporation", "version" => "0.1.0"} tags: imagereferring-expressionkittibounding-boxesauto-labelvlm allowed-tools: Read Bash Write