Agent Skill · NVIDIA NIM

tao-train-sparse4d

Sparse4D for multi-camera temporal 3D object detection and tracking. Uses sparse queries with deformable attention across camera views and time for end-to-end 3D perception, with an instance bank for temporal tracking. Use when training, evaluating, exporting, quantizing, or running inference for a TAO Sparse4D model. Trigger phrases include "train Sparse4D", "multi-camera 3D detection", "temporal 3D tracker", "sparse query 3D perception".

Provider: NVIDIA NIM Path in repo: skills/tao-train-sparse4d/SKILL.md

Skill body

Sparse4D

Sparse4D for multi-camera temporal 3D object detection and tracking. Uses sparse queries with deformable attention across camera views and time for end-to-end 3D perception. Includes instance bank for temporal tracking.

Use a pretrained ResNet-101 backbone when one is available by setting train.pretrained_model_path. For local smoke validation, Sparse4D training can run with an empty train.pretrained_model_path, but production runs should still use a compatible PTM.

Dataclass Schemas

Generated TAO Core schemas are packaged in schemas/<action>.schema.json, with schemas/manifest.json listing available actions. Each generated schema also emits references/spec_template_<action>.yaml from the schema top-level default field. AutoML enablement is declared at the model layer in references/skill_info.yaml via automl_enabled. Runnable AutoML still requires schemas/train.schema.json and references/spec_template_train.yaml to exist and parse. Use the packaged train schema for automl_default_parameters, automl_disabled_parameters, defaults, min/max bounds, enums, option weights, math conditions, dependencies, and popular parameters. Do not expect ~/tao-core at runtime; maintainers regenerate schemas/templates before packaging the skill bank.

Train Action Policy

This model is AutoML-enabled at the model layer. Before handling any train-stage request, read references/skill_info.yaml and resolve the run override from either an explicit automl_policy value or the user’s workflow request. Use automl_policy: on by default and only expose on / off in new launch prompts. Treat phrases like “turn off AutoML”, “disable AutoML”, “no HPO”, or “plain training” as automl_policy: off for this run only. When automl_policy: on, automl_enabled: true, and both schemas/train.schema.json and references/spec_template_train.yaml are packaged, route the train action through tao-skill-bank:tao-run-automl by default with this model’s skill_dir. Preserve workflow/application overrides for datasets, specs, output directories, GPU/platform settings, parent checkpoints, and automl_policy. Use direct model training only when automl_policy: off or the packaged train schema/template is missing; in the missing-schema case, report that AutoML is enabled but not runnable for this model until schemas are generated.

Non-train actions such as evaluate, inference, export, and deploy flows stay in this model skill. The per-run automl_policy override does not change model metadata.

Training Requirements

Per-Action Dataset Requirements

Action Spec Key Source Files List?
dataset_convert aicity.root id   No
evaluate dataset.data_root eval_dataset (from convert job, spec: aicity.split) No
evaluate model.head.instance_bank.anchor train_datasets /results/{dataset_convert_job_id}/anchor_init.npy No
evaluate dataset.train_dataset.ann_file train_datasets (from convert job, spec: aicity.split) No
evaluate dataset.val_dataset.ann_file eval_dataset (from convert job, spec: aicity.split) No
evaluate dataset.test_dataset.ann_file inference_dataset (from convert job, spec: aicity.split) No
export model.head.instance_bank.anchor train_datasets /results/{dataset_convert_job_id}/anchor_init.npy No
inference dataset.data_root inference_dataset (from convert job, spec: aicity.split) No
inference model.head.instance_bank.anchor train_datasets /results/{dataset_convert_job_id}/anchor_init.npy No
inference dataset.train_dataset.ann_file train_datasets (from convert job, spec: aicity.split) No
inference dataset.val_dataset.ann_file eval_dataset (from convert job, spec: aicity.split) No
inference dataset.test_dataset.ann_file inference_dataset (from convert job, spec: aicity.split) No
quantize dataset.data_root train_datasets (from convert job, spec: aicity.split) No
quantize model.head.instance_bank.anchor train_datasets /results/{dataset_convert_job_id}/anchor_init.npy No
quantize dataset.train_dataset.ann_file train_datasets (from convert job, spec: aicity.split) No
quantize dataset.val_dataset.ann_file eval_dataset (from convert job, spec: aicity.split) No
quantize dataset.test_dataset.ann_file inference_dataset (from convert job, spec: aicity.split) No
quantize dataset.quant_calibration_dataset.images_dir train_datasets   No
train dataset.data_root train_datasets (from convert job, spec: aicity.split) No
train model.head.instance_bank.anchor train_datasets /results/{dataset_convert_job_id}/anchor_init.npy No
train dataset.train_dataset.ann_file train_datasets (from convert job, spec: aicity.split) No
train dataset.val_dataset.ann_file eval_dataset (from convert job, spec: aicity.split) No
train dataset.test_dataset.ann_file inference_dataset (from convert job, spec: aicity.split) No

Typical Spec Overrides

Data source overrides are mandatory for every action — the agent MUST construct data source paths from the Per-Action Dataset Requirements table above and include them in spec_overrides.

S3_TRAIN = "s3://bucket/data/train"
S3_EVAL = "s3://bucket/data/eval"
CONVERTED_SCENE = "<scene-from-converter>"  # e.g. "subsetscene+bev-sensor-random-0"

train (mandatory data sources):

CONVERTED = "s3://bucket/results/<dataset_convert_job_id>"
{
    "train.num_epochs": 30,
    "train.checkpoint_interval": 10,
    "train.validation_interval": 10,
    "train.num_gpus": 1,
    "dataset.sequences.split_num": 90,
    "dataset.train_dataset.sequences_split_num": 90,
    "dataset.data_root": f"{S3_TRAIN}/train",
    "model.head.instance_bank.anchor": f"{CONVERTED}/anchor_init.npy",
    "dataset.train_dataset.ann_file": f"{CONVERTED}/train/{CONVERTED_SCENE}_infos_train.pkl",
    "dataset.val_dataset.ann_file": f"{CONVERTED}/val/{CONVERTED_SCENE}_infos_val.pkl",
    "dataset.test_dataset.ann_file": f"{CONVERTED}/test/{CONVERTED_SCENE}_infos_test.pkl",
}

evaluate (mandatory data sources):

CONVERTED = "s3://bucket/results/<dataset_convert_job_id>"
{
    "dataset.data_root": f"{S3_EVAL}/val",
    "model.head.instance_bank.anchor": f"{CONVERTED}/anchor_init.npy",
    "dataset.train_dataset.ann_file": f"{CONVERTED}/train/{CONVERTED_SCENE}_infos_train.pkl",
    "dataset.val_dataset.ann_file": f"{CONVERTED}/val/{CONVERTED_SCENE}_infos_val.pkl",
    "dataset.test_dataset.ann_file": f"{CONVERTED}/test/{CONVERTED_SCENE}_infos_test.pkl",
}

export (mandatory data sources):

CONVERTED = "s3://bucket/results/<dataset_convert_job_id>"
{
    "model.head.instance_bank.anchor": f"{CONVERTED}/anchor_init.npy",
}

inference (mandatory data sources):

CONVERTED = "s3://bucket/results/<dataset_convert_job_id>"
{
    "dataset.data_root": f"{S3_EVAL}/test",
    "model.head.instance_bank.anchor": f"{CONVERTED}/anchor_init.npy",
    "dataset.train_dataset.ann_file": f"{CONVERTED}/train/{CONVERTED_SCENE}_infos_train.pkl",
    "dataset.val_dataset.ann_file": f"{CONVERTED}/val/{CONVERTED_SCENE}_infos_val.pkl",
    "dataset.test_dataset.ann_file": f"{CONVERTED}/test/{CONVERTED_SCENE}_infos_test.pkl",
}

quantize (mandatory data sources):

CONVERTED = "s3://bucket/results/<dataset_convert_job_id>"
{
    "dataset.data_root": f"{S3_TRAIN}/train",
    "model.head.instance_bank.anchor": f"{CONVERTED}/anchor_init.npy",
    "dataset.train_dataset.ann_file": f"{CONVERTED}/train/{CONVERTED_SCENE}_infos_train.pkl",
    "dataset.val_dataset.ann_file": f"{CONVERTED}/val/{CONVERTED_SCENE}_infos_val.pkl",
    "dataset.test_dataset.ann_file": f"{CONVERTED}/test/{CONVERTED_SCENE}_infos_test.pkl",
    "dataset.quant_calibration_dataset.images_dir": f"{S3_TRAIN}",
}

See references/local_docker_conversion.md for local-docker conversion roots and mounts, H5 depth-path normalization, converted annotation filenames, smoke-run max_num_cams/anchor contracts for export compatibility, and converted-artifact verification before train/evaluate/inference.

Eval Dataset

Optional. Val/test splits configured via dataset ann_file paths.

Important Parameters

Multi-GPU / Multi-Node

Launch method: Lightning-managed (single python process, Lightning spawns workers).

Spec Key Description Default
train.num_gpus Number of GPUs 1
train.gpu_ids GPU device indices [0]
train.num_nodes Number of nodes 1

Multi-node env vars (set by orchestrator): WORLD_SIZE, NODE_RANK, MASTER_ADDR, MASTER_PORT, NUM_GPU_PER_NODE.

Hardware

Minimum 2 GPU(s), recommended 8 GPU(s). 40GB+ (A100 recommended) VRAM per GPU. Multi-camera temporal model is memory intensive. bf16 required for practical training. Multi-GPU strongly recommended. Instance bank requires substantial memory for temporal reasoning.

Error Patterns

dataset_convert required: Must run dataset_convert first to produce annotation pickles and anchor_init.npy.

dataset_convert container/command: Sparse4D conversion is an AICity to OVPKL annotations conversion. Launch dataset_convert with the action-level tao_toolkit.data_services image and annotations convert -e {config_path}; do not use the PyTorch sparse4d CLI for conversion. Train/evaluate/export/ inference still use the model-level PyTorch image.

Stable raw-data path: The AICity to OVPKL converter writes image paths into the generated pickle files. Keep aicity.root at /data/aicity_root during conversion, then point dataset.data_root at the split folder, for example /data/aicity_root/train for training or /data/aicity_root/val for evaluation. This preserves the converter’s absolute RGB paths and relative depth paths.

H5 depth tuple mismatch: If training fails with an H5 path error where the trainer tries to open a camera directory such as /data/aicity_root/train/<scene>/Camera, run models/sparse4d/scripts/normalize_depth_paths.py --data-root <host-aicity-root>/train <converted-ann-dir> after dataset_convert and before train/evaluate/inference. The helper rewrites converted depth_map_path tuples to point at <scene>/depth_maps/<camera>.h5 with the H5 dataset key basename.

Missing anchor file: Set model.head.instance_bank.anchor to the anchor_init.npy path from dataset_convert results.

Temporal OOM: Reduce dataset.num_frames or dataset.batch_size if running out of memory during temporal training.

Quantize image compatibility: The model-skill wiring should pass quantize.model_path through the parent-model resolver, and checkpoint handoff should select the exact epoch/step checkpoint just like evaluate, inference, export, and resume. TorchAO checkpoint quantization passes in the validation-fixes-20260525 PyT image and writes quantized_model_torchao.pth. Older 7.0.0-rc PyT images may fail inside the Sparse4D quantize entrypoint or lack ONNX quantization dependencies; do not remove or skip the advertised quantize action if that occurs. Report the container/image failure and keep the exact checkpoint path visible.

Spec Param / Parent Model Inference

See references/spec_param_inference.md for the model-specific inference mappings from TAO Core sparse4d.config.json (the per-action spec-field to inference-function table) and the parent_model/parent_job_id checkpoint-resolution rules that generated runners apply with SDK helpers before create_job().

Skill frontmatter

license: Apache-2.0 compatibility: Requires docker + nvidia-container-toolkit. metadata: {"version" => "0.1.0", "author" => "NVIDIA Corporation"} allowed-tools: Read Bash tags: temporal3ddetectiontracking