Agent Skill · NVIDIA NIM

tao-generate-image-grounding

Two-step image grounding pipeline: extracts referring expressions from (image, caption) pairs and grounds them to pixel-space bounding boxes via a VLM. Use when the user wants to ground captions to bboxes, generate phrase-grounded annotations, auto-label images for grounding, or run the image_grounding pipeline. Triggers include 'image grounding', 'phrase grounding', 'ground captions', 'auto-label image grounding', 'image_grounding'.

View SKILL.md on GitHub → Source repository Provider profile

Provider: NVIDIA NIM Path in repo: skills/tao-generate-image-grounding/SKILL.md

Skill body

Image Grounding Pipeline

Turn (image, caption) pairs into per-image grounded annotations: cleaned captions, referring expressions with character spans, and pixel-space bounding boxes for each expression. A single VLM (Gemini or any OpenAI-compatible endpoint) handles both steps.

Purpose

Generate phrase-grounded training data for referring-expression and grounding models. The VLM acts as a “teacher” annotator: Step 0 extracts referring expressions from the caption while looking at the image; Step 1 returns one bbox set per expression for each image.

Pipeline Architecture

Step 0: Expression extraction  → VLM cleans caption, extracts referring expressions + char spans
Step 1: Phrase grounding       → VLM returns pixel bboxes + scores per expression

Steps are individually selectable via workflow.steps. Each step writes a per-sample checkpoint to step_<N>_*/.ckpt/<sample_id>.json and skips already-processed records on re-run. Set workflow.force_reprocess: true to ignore checkpoints and reprocess from scratch.

Instructions

Initial setup

When a user wants to run this pipeline, walk through these steps:

Input JSONL: Ask for the JSONL path. Each line must be one object like {"image_path": "...", "caption": "..."}. image_path can be absolute or relative.
Image root: If any image_path values are relative, set data.image_root to the directory they should resolve from.
API access: Ask the user which VLM endpoint they want to use. Present these five options and act on the choice:
1. Gemini — set vlm.backend: "gemini"; require GOOGLE_API_KEY (env var or vlm.gemini.api_key).
2. NIM (e.g. https://inference-api.nvidia.com/v1) — set vlm.backend: "openai"; collect base_url, model_name, and api_key.
3. TAO inference microservice (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
  - Running — collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
  - Not running — guide the user through the skills/applications/tao-run-inference-service skill, which stands up a local TAO inference microservice with an OpenAI-compatible API. Before promising a specific model, check skills/applications/tao-run-inference-service/references/service.yaml for valid_network_arch_config_basenames. Once the server is up, collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
4. vLLM (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
  - Running — collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
  - Not running — follow references/vllm_server.md to install and launch a vLLM server, then collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
5. Custom (any other OpenAI-compatible endpoint) — set vlm.backend: "openai"; collect base_url, model_name, and (optionally) api_key.
If the user has no endpoint and does not want to set one up, stop and help resolve API access first.
Workflow steps: Choose one of:
- Full pipeline: ["0", "1"]
- Expression extraction only: ["0"]
- Grounding only: ["1"], which requires existing step-0 output at results_dir/step_0_expression_extraction/annotations.jsonl
Resume vs fresh run: By default, the workflow reuses checkpoints and skips completed records. To reprocess everything, set image_grounding.workflow.force_reprocess=true.

Running the pipeline

The pipeline runs inside the TAO Toolkit container via the auto_label CLI:

auto_label generate -e /path/to/spec.yaml \
    results_dir=/results \
    image_grounding.data.input_jsonl=/data/captions.jsonl \
    image_grounding.data.image_root=/data/images \
    image_grounding.vlm.gemini.api_key=$GOOGLE_API_KEY

Generate a default spec: auto_label default_specs results_dir=/results module_name=auto_label, then set autolabel_type: "image_grounding". All fields support Hydra dot-notation overrides on the command line.

See references/configuration.md for the full YAML structure, all parameters, model/endpoint setup, and error patterns.

Recommended pilot workflow

Run on 5-10 images with both steps
Inspect step_0_expression_extraction/annotations.jsonl — are cleaned_caption and expressions[] accurate? Are the right noun phrases captured?
Inspect step_1_grounding/annotations.jsonl — do the bboxes in expressions[].instances[] look right? Are confidence scores reasonable?
If quality is insufficient, switch the VLM to a stronger model (e.g. gemini-2.5-pro) or raise media_resolution/max_output_tokens, then re-run with force_reprocess=true.
Scale to the full dataset once satisfied.

Configuration

Key configuration fields (full reference in references/configuration.md):

Field	Default	Description
`workflow.steps`	`["0","1"]`	Which pipeline steps to execute (`"0"` = expressions, `"1"` = grounding)
`workflow.max_workers`	`4`	Parallel threads per step (watch API rate limits)
`workflow.force_reprocess`	`false`	Ignore per-sample checkpoints and reprocess from scratch
`vlm.backend`	`"gemini"`	`"gemini"` or `"openai"` (OpenAI-compatible endpoint)
`data.input_jsonl`	required	Path to input JSONL with `image_path` + `caption` per line
`data.image_root`	`""`	Optional prefix for resolving relative `image_path` entries

Inputs

A single JSONL file at data.input_jsonl. One JSON object per line:

Field	Required	Description
`image_path`	yes	Absolute path, or relative path resolved against `data.image_root`
`caption`	yes	Free-text caption for the image
`image_id`	no	Stable identifier; auto-derived from the filename if missing
`width`, `height`	no	Image dimensions in pixels; default to `1920×1080` for bbox clamping if missing

Outputs

All outputs go to results_dir/:

step_0_expression_extraction/annotations.jsonl — per-record output enriched with cleaned_caption and expressions[] (each with text, expression_id, char_span, noun_chunk, empty instances[]).
step_1_grounding/annotations.jsonl — same records with expressions[].instances[] filled in (each instance has bbox: [x1,y1,x2,y2] in pixel space, score in [0.0, 1.0], and bbox_id).
results_dir/annotations.jsonl — copy of the last step’s output for convenience.
step_<N>_*/.ckpt/<sample_id>.json — per-sample checkpoints used for resume.

Prerequisites

Container: nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
API access: At least one VLM endpoint (Gemini API key or OpenAI-compatible endpoint capable of image input)

Skill frontmatter

license: Apache-2.0 compatibility: Requires docker + nvidia-container-toolkit + at least one VLM endpoint (Gemini API key or OpenAI-compatible). metadata: {"author" => "NVIDIA Corporation", "version" => "0.1.0"} allowed-tools: Read Bash Write tags: imagegroundingbounding-boxesauto-labelvlm2d-grounding