Agent Skill · NVIDIA NIM

tao-finetune-cosmos-reason

Cosmos3-Nano video QA supervised fine-tuning with FSDP parallelism. Use when training or evaluating video question-answering models, fine-tuning Cosmos3-Nano or compatible Cosmos Reason models with SFT/LoRA, or working with Cosmos-RL. Trigger phrases include "fine-tune Cosmos", "Cosmos3 Nano Reasoner", "Cosmos-RL SFT", "video QA fine-tune", "Cosmos3-Nano training".

Provider: NVIDIA NIM Path in repo: skills/tao-finetune-cosmos-reason/SKILL.md

Skill body

Cosmos-RL

Supervised fine-tuning (SFT) of Cosmos Reason video QA models. The packaged default base model is hf_model://nvidia/Cosmos3-Nano. Pretrained weights are sourced from HuggingFace, not NGC. Gated HuggingFace models require HF_TOKEN. Some Cosmos-RL images cannot load the native Cosmos3 Omni checkpoint format directly; for those images, convert Cosmos3-Nano to a Qwen3-VL HF safetensors directory before train/evaluate and use that converted directory as the PTM path.

Uses FSDP-based parallelism with dp_shard_size for GPU count and dp_replicate_size for node count (not the standard num_gpus/num_nodes).

Requests for “Cosmos Reason 3”, “Cosmos3 Nano Reasoner”, or nvidia/Cosmos3-Nano are handled by this skill. There is no separate Cosmos3 model directory in the skill bank; route those requests here. Override the base HuggingFace model only when the user explicitly asks for a different model.

Deep detail lives in references; load the smallest one that matches the task:

Dataclass Schemas

Generated TAO Core schemas are packaged in schemas/<action>.schema.json, with schemas/manifest.json listing available actions. Each generated schema also emits references/spec_template_<action>.yaml from the schema top-level default field. AutoML enablement is declared at the model layer in references/skill_info.yaml via automl_enabled. Runnable AutoML still requires schemas/train.schema.json and references/spec_template_train.yaml to exist and parse. Use the packaged train schema for automl_default_parameters, automl_disabled_parameters, defaults, min/max bounds, enums, option weights, math conditions, dependencies, and popular parameters. Do not expect ~/tao-core at runtime; maintainers regenerate schemas/templates before packaging the skill bank.

Train Action Policy

This model is AutoML-enabled at the model layer. Before handling any train-stage request, read references/skill_info.yaml and resolve the run override from either an explicit automl_policy value or the user’s workflow request. Use automl_policy: on by default and only expose on / off in new launch prompts. Treat phrases like “turn off AutoML”, “disable AutoML”, “no HPO”, or “plain training” as automl_policy: off for this run only. When automl_policy: on, automl_enabled: true, and both schemas/train.schema.json and references/spec_template_train.yaml are packaged, route the train action through tao-skill-bank:tao-run-automl by default with this model’s skill_dir. Preserve workflow/application overrides for datasets, specs, output directories, GPU/platform settings, parent checkpoints, and automl_policy. Use direct model training only when automl_policy: off or the packaged train schema/template is missing; in the missing-schema case, report that AutoML is enabled but not runnable for this model until schemas are generated.

Non-train actions such as evaluate, inference, and quantize stay in this model skill. The per-run automl_policy override does not change model metadata.

Credentials

Cosmos3 Checkpoint Conversion

When a selected image cannot load the native Cosmos3 checkpoint format (model_type="cosmos3_omni" or Cosmos3ForConditionalGeneration), do not patch QwenVL, Transformers, or vLLM first. Use the upstream Cosmos Framework VLM conversion path to produce a Qwen3-VL HF safetensors directory, then point Cosmos-RL specs at that converted directory.

The model skill packages a helper:

python skills/models/tao-finetune-cosmos-reason/scripts/prepare_cosmos3_vlm_checkpoint.py \
  --checkpoint-path /abs/path/Cosmos3-Nano \
  --output-path /abs/path/Cosmos3-Nano-VLM \
  --secrets-env ~/.tao/secrets.env \
  --validate-with-image <cosmos-rl-image>

After conversion, use the converted directory consistently as the PTM:

train:    policy.model_name_or_path=/abs/path/Cosmos3-Nano-VLM
evaluate: model.model_name=/abs/path/Cosmos3-Nano-VLM
evaluate: model.base_model_path=/abs/path/Cosmos3-Nano-VLM

For local Docker, mount the converted directory read-only into the Cosmos-RL container and set the spec to the container path. If a converted copy already exists and validates, reuse it for PTM baseline evaluation, AutoML recommendations, and final best-checkpoint evaluation rather than converting again.

Training Requirements

Spec construction

cosmos-rl is mode: config. Always start from the packaged references/spec_template_<action>.yaml for the requested action — load it as your base spec via yaml.safe_load(...) and apply user overrides on top. Don’t rebuild from scratch.

import yaml
from pathlib import Path

skill = Path.home() / "tao-sdk/tao-skills-external/skills/models/tao-finetune-cosmos-reason"
action = "train"  # train, evaluate, inference, or quantize
specs = yaml.safe_load((skill / f"references/spec_template_{action}.yaml").read_text())
# Now apply your overrides on top of `specs`.

The reference TOML (and the spec the model actually consumes) is nested dicts, not flat dotted keys. Dotted notation in override examples denotes paths into the nested spec — walk the path and assign at the leaf. See skills/platform/tao-run-platform/SKILL.md’s “spec is nested dicts” callout. Data source overrides are mandatory for every action.

The packaged template keeps custom.vision.nframes=8 for bounded 1-GPU memory; switch to fps only after checking token budget and GPU memory, and delete custom.vision.nframes from the spec when you do.

See references/cosmos-reason-launch.md for launch intake, the full check_tao_launch_preflight.py slurm/local-Docker examples, the video_fps preflight example, S3 staging, the GPU resource/architecture gate, the per-action dataset requirements table, the /workspace mount caveat, the quantize compatibility shim, and the full typical-overrides list.

Critical Overrides (Train)

These are the keys whose template defaults are wrong or where omission flips the run into a different mode:

Parameter Template Default Required Value Why
policy.model_name_or_path hf_model://nvidia/Cosmos3-Nano Direct Docker: nvidia/Cosmos3-Nano, hf_model://nvidia/Cosmos3-Nano, or a local HF snapshot path. SDK/managed platform predownload: hf_model://nvidia/Cosmos3-Nano. Keep the train and evaluate base model aligned.
policy.model_max_length 40960 Keep at 40960 or higher Smaller than ~40k causes vision_embeds shape mismatch on video inputs
train.train_batch_per_replica 32 Any multiple of train.train_policy.mini_batch Mismatch raises an immediate AssertionError
train.train_policy.type "sft" Keep as "sft" for SFT workflows If dropped during agent regeneration, cosmos-rl flips to RL mode → rollout replica allocated → multi-node attempted → hostname errors when num_nodes=1

Evaluate

The evaluator reads a flat TOML config (dataset, model, task, evaluation, vision, generation, metrics, results, num_gpus, results_dir); the actions.evaluate block in references/skill_info.yaml declares inputs and outputs. See references/cosmos-reason-evaluate.md for the flat-TOML config detail, task types ("" General Evaluator vs "its_directionality"), LoRA evaluation via spec_overrides, selective download, results/metrics, and the datasets section.

AutoML / HPO Notes

The packaged default base model is hf_model://nvidia/Cosmos3-Nano; apply it consistently to train (policy.model_name_or_path) and post-training evaluation (model.base_model_path) unless the user provides a different model. See references/cosmos-reason-automl.md for accuracy-vs-val/avg_loss objective selection, the eval_fn per-recommendation evaluate flow, the knob mapping (learning rate, batch size, epochs, weight decay, warmup ratio), example custom_param_ranges, train_sample_count batch-size capping, ordered_int requirements, and the pre-launch recommendation summary.

Parameters, Hardware, Errors, DEFT, Inference

For parallelism, set policy.parallelism.dp_shard_size = GPUs per node and policy.parallelism.dp_replicate_size = node count (1 for single node). Cosmos-RL handles distributed init internally via FSDP and does not rely on platform-level MASTER_ADDR/WORLD_SIZE; submit with gpu_count=<gpus_per_node> and num_nodes=<N> and the spec keys drive sharding. Cosmos-RL models are 8B parameters; recommended 8x A100 or H100 (80GB each).

See references/cosmos-reason-parameters.md for important parameters (training loop, model/policy, parallelism incl. multi-node FSDP, optimization, vision encoders, checkpointing incl. the best symlink/epoch_* resolution, validation, logging), hardware sizing, the full error-pattern catalog (CUDA OOM, LoRA-eval OOM, NaN loss, vision_embeds mismatch, quantize token mismatch, batch-size divisibility and per-rank limits, stale cache, scheduler-None, gated-repo HF_TOKEN, GPU resource/architecture gate, status-logging warnings), DEFT support and scripts/analyze_gaps.py gap analysis, and the parent-model inference mapping table.

Skill frontmatter

license: Apache-2.0 compatibility: Requires docker + nvidia-container-toolkit. metadata: {"author" => "NVIDIA Corporation", "version" => "0.1.0"} allowed-tools: Read Bash tags: videoqacosmossftreasoningvlm