tao-finetune-huggingface-model
Fine-tune any HuggingFace CV / VLM / LLM model on local NVIDIA GPUs inside an NGC PyTorch container. Use when the user wants to fine-tune a HuggingFace model (full or LoRA), train a vision / VLM / LLM model end-to-end, generate a reproducible HF training pipeline, smoke-test a HuggingFace model locally before scale-up, push a fine-tuned model to the HF Hub with a model card, or emit a self-contained rerun skill for an existing HuggingFace finetune. Supports image classification, object detection, semantic / instance / panoptic segmentation, depth estimation, image-text-to-text VLM (SFT / LoRA), and LLM SFT / DPO / GRPO. Six-step workflow: inspect and qualify, hardware and NGC image, research, generate and smoke, train + eval + infer, push and emit rerun skill.
Skill body
tao-finetune-huggingface-model
Local NVIDIA GPU fine-tuning for HuggingFace models, grounded in live-fetched documentation with curated references as a fallback safety net. One NGC container, a few focused scripts, one push to HF Hub. Follow the rules in this file; don’t improvise.
Order of authority (highest first):
- User input — explicit
model_id,dataset_id,training_method,config.yamloverrides. - Live research — model card, HF repo example, author finetune script, HF task docs, paper; always fetched (Step 3 +
references/research-priorities.md). - Curated references (
references/*.md) — fallback when live research is silent/ambiguous. - Your training-data memory — last resort; suspect, cross-check against (2)/(3).
Conflict resolution between (2) and (3) and the source-line discrepancy note are
in references/research-priorities.md.
Inputs
Required:
model_id— HuggingFace model ID, e.g.google/vit-base-patch16-224
Conditional credentials (read from the session environment, exported before launching when present):
HF_TOKEN— only when the model/dataset is gated (read) orpush_to_hubis on (write); public + public +push_to_hub: falseneeds none. Value never read — presence-only via[ -n "$HF_TOKEN" ].WANDB_API_KEY,WANDB_PROJECT— only when WandB is enabled;WANDB_MODE=disabledopts out.
Dataset — exactly one:
dataset_id— HuggingFace dataset ID (source:hf)local_dataset_path— local folder or file (source:local); optionallocal_dataset_format∈ {auto, imagefolder, coco, voc, jsonl, arrow, parquet, csv} (default: auto-detect).- (omit) — agent recommends popular datasets (source:
recommend)
Optional (have defaults):
task_type— auto-detected from config + model cardn_train=10000,n_eval=1000,n_epochs=3,lora_r=16output_dir=./output/<model_short_name>hf_model_repo— push target; if unset and HF_TOKEN has write access, auto-derived as<whoami>/<model_short_name>-finetuned.push_to_hub=True— set toFalseto skipskip_baseline=False— skip zero-shot baseline eval
Optional deliverables (off by default):
emit_progress_log: false # output_dir/PROGRESS.md (per-step journal)
emit_report: false # reports/report.{pdf,html} with curves & samples
emit_unit_tests: false # tests/ with fake-data heterogeneous-batch tests
All values live in output_dir/config.yaml. Never hardcode in Python.
Execution platform
This skill orchestrates what to run; the platform skills own how to run it on a GPU host — read them first.
| Concern | Authoritative skill |
|---|---|
| GPU host runtime (driver 580, CUDA Toolkit 13.0, NVIDIA Container Toolkit 1.19.0) | tao-skill-bank:tao-setup-nvidia-gpu-host |
docker run flags, NGC auth, mounts, env passthrough |
tao-skill-bank:tao-run-on-docker |
| Local Docker job preflight (daemon, GPU smoke) | tao-skill-bank:tao-run-on-local-docker |
Default platform: local-docker — build a one-off image (run-<short>:latest)
and run it on the local Docker daemon. Ask only when the user explicitly needs a
different backend (Brev remote GPU, SLURM/Kubernetes); then run that platform’s
Preflight first and route the Steps 4–5 docker run commands through it. The
GPU-runtime and presence-only credential preflights (values never read), the
canonical docker run flag set, the list_tao_platforms.py selection command, and
the workflow-specific flags (--entrypoint /bin/bash -lc, PYTORCH_CUDA_ALLOC_CONF,
--name hft_train) are in references/workflow-intake-preflight.md.
References — fallback safety net
Consulted only when live research is silent, ambiguous, or unavailable; live
docs always win for the specific model and current API. Each step links the
references it needs; full catalog in references/detailed-workflow.md.
Always-on: core-rules.md, error-playbook.md, compat-workarounds.md,
model-discovery.md, dataset-recommendations.md, dataset-sources.md,
dataset-patterns.md, hardware-container.md, research-priorities.md,
cv-scripts.md, vlm-scripts.md, docker-runs.md, hub-push.md,
pipeline-skill-template.md, deliverables.md. Opt-in (when their flag/need
applies): progress-tracking.md, testing.md, reporting.md,
workflow-intake-preflight.md, workflow-generate-train.md, workflow-push-rerun.md.
Rule: before falling back, log the live source you tried and why it was
insufficient (config.yaml notes:, and PROGRESS.md if enabled). [FETCH LIVE]
markers in cv-scripts.md / vlm-scripts.md are a research checklist, not code to
inline — refetch the listed URL if a block has no Step 3 finding.
Core rules
Non-negotiable behaviors. Short version (full enumeration —
hallucinated-imports list, never-without-approval list, full error-recovery and
hardware-sizing tables — in references/core-rules.md, consult before any
training-time decision):
- Your HF-library knowledge is outdated. Fetch live docs (model card, HF repo example, task doc) before writing any ML code — don’t generate trainer args / collator / transforms from memory (Step 3).
- Smoke-test on real data with
--max_steps 1before any full run; no batch launches without a verified smoke. - Never silently substitute model_id, dataset_id, or training_method — if what the user asked for doesn’t load, stop and ask.
- Error recovery is minimal-change. OOM → halve batch, double grad_accum, enable gradient checkpointing (no LoRA switch without approval); NaN → reduce LR 10×; flat loss → inspect collator; same error 3× → stop and ask. Don’t loop.
- Dataset columns verified BEFORE the collator — rename in
prepare_data.py; restructuring needed → stop and ask. - Hardware-sizing thumb (bf16): ≤3B → 24 GB, 7–13B → 80 GB, 30B+ → multi-GPU or LoRA on 1× 80 GB, 70B+ → 8× 80 GB or LoRA. Full finetune won’t fit and no LoRA requested → ask before switching.
Workflow — 6 steps
Single pass, sequential; each step has a clear gate before the next begins.
Step 1 — Inspect & qualify
Goal: decide whether to proceed. Probe model + dataset, apply accept/reject,
register applicable compat fixes, write the initial config.yaml.
Prerequisites: MODEL_ID, optional DATASET_ID / local_dataset_path,
optional HF_TOKEN, OUTPUT_DIR (default ./output/<model_short_name>). Probes
run in a CPU-only python:3.12-slim Docker container (bind-mounted .probe/
scratch) so the host needs no virtualenv — Docker must exist first. Docker-presence
guard, container env, full probe invocation, and the model/dataset probe scripts
are in references/workflow-intake-preflight.md, references/model-discovery.md,
and references/dataset-sources.md.
Probe requirements:
- Model: load
AutoConfig, read model-card tags, detect task fromarchitectures+ tags + card examples (fallback logging inmodel-discovery.md). - Dataset: for recommended datasets, first present 3-5 choices from
dataset-recommendations.md; for local data, bind-mount read-only and usedataset-sources.mdformat detection. - Reject early if the model config fails, the task is out of scope, no recipe source exists, or the dataset cannot load / match the task schema.
- Evaluate
compat-workarounds.mdagainst the model/task; defer hardware-dependent rules to Step 2.
Write the initial config.yaml (model_id, task, dataset_id or
local_dataset_path, research_sources: [] filled in Step 3,
applicable_workarounds: from Step 1, notes: [] for reference fallbacks,
push_to_hub: true default — annotated template in
references/workflow-intake-preflight.md). Optionally rm -rf "$OUTPUT_DIR/.probe"
once the gate is met.
Gate: config.yaml exists with model, dataset, task, applicable_workarounds;
do not proceed if any field is missing.
Step 2 — Hardware audit & NGC image
Goal: verify Docker + GPU + disk, pick the NGC PyTorch image live, finalize hardware-dependent compat rules.
2a. Audit (hard gate) — three checks (commands in
references/workflow-intake-preflight.md):
- GPU host runtime —
tao-setup-nvidia-gpu-host’ssetup-nvidia-gpu-host.sh --backend docker --check-only; on fail, ask approval then re-run with--install --yes. - Free-disk soft-warn — override via
MIN_DISK_GB(default 100 GB); recommend ≥ 100 GB for NGC base (~20 GB) + HF cache + checkpoints + data. - Conditional credential presence (from the session environment, values never
read) —
HF_TOKENonly when gated orpush_to_hubis on;WANDB_*only when WandB is on.
Do not proceed to Step 4 on a hard-fail — Step 4’s docker build pulls a
20+ GB NGC base, and a missing nvidia-container-toolkit only surfaces later as
could not select device driver "" with capabilities: [[gpu]]. Record gpu_count,
gpu_name, driver_major, vram_gb_per_gpu in config.yaml.
2b. Pick NGC image (live): from the NVIDIA Deep Learning Frameworks support
matrix (https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html),
PyTorch NGC container section, pick the highest-versioned image where
Min driver ≤ detected driver_major and container CUDA ≤ host CUDA Toolkit
(match closely so cuDNN / TensorRT line up). Do not reject an image for an
aN/bN/rcN PyTorch tag — NGC validates the full image; pick the newest
CUDA-aligned one and let compat-workarounds.md handle per-version issues. If the
matrix is unreachable, use the fallbacks in references/hardware-container.md;
default nvcr.io/nvidia/pytorch:24.09-py3 (driver ≥ 545; SDPA+GQA bug — if
num_key_value_heads < num_attention_heads, set attn_implementation: "eager").
Record ngc_image in config.yaml.
2c. Re-evaluate hardware-dependent compat rules: re-run the
compat-workarounds.md walk for entries whose detect needs hw; update
applicable_workarounds: in place.
2d. Model-fit check: estimate param_bytes ≈ 2×param_count (bf16); if
60% of
vram_gb_per_gpu × 1e9, recommend LoRA in the user-facing summary.
Gate: config.yaml has ngc_image, gpu_count, gpu_name, driver_major,
vram_gb_per_gpu; hardware-dependent compat fixes recorded.
Step 3 — Research the recipe
Goal: fetch the live recipe — training-data knowledge of
transformers/trl/peft is suspect, so Step 3 is non-negotiable. Walk
references/research-priorities.md in priority order (Priority 1 → 6); stop once
you have, for the detected task:
AutoModel/ processor class- Train + eval transforms
- Collator
compute_metrics- Hyperparameter hints (LR, batch size, epochs, scheduler)
Record findings in meta/recipe.md, append source URLs to
config.yaml: research_sources:. A slot with no live finding falls back to the
matching scaffold (cv-scripts.md / vlm-scripts.md), logged as “fallback to
scaffold — no live source for
Gate: every required slot filled, with a source URL or scaffold-fallback note.
Step 4 — Generate project & smoke-test
Goal: write all scripts, build the image, prepare data, run a 1-step smoke on
real data (one docker build, two docker runs).
4a. Generate project files in output_dir/: config.yaml, Dockerfile,
requirements.txt, prepare_data.py, train.py, run_eval.py, infer.py,
optional merge_lora.py, optional tests/, .gitignore. Live Step 3 research is
authority; cv-scripts.md / vlm-scripts.md give scaffold shape only. Apply every
applicable_workarounds entry as a Dockerfile block, requirement pin, config
override, or runtime env var. Hard rules: run_eval.py keeps that exact filename
(avoids colliding with the HF evaluate package); every generated .py starts
with the NVIDIA Apache-2.0 copyright header and any emitter fails when it is
missing; emit_unit_tests: true generates and runs tests per
references/testing.md. Script bodies, Dockerfile shape, and the emitter contract
are in references/workflow-generate-train.md.
4b. Build, prepare, smoke — docker build -t run-<short>:latest ., then
prepare_data and the --smoke --max_steps 1 run (references/docker-runs.md
§1-3). Smoke pass criteria (in logs/smoke.log):
- No exception
- Loss is finite (not
0.0, notNaN) grad_norm > 0at step 1
If emit_unit_tests: true, also run pytest tests/ in the container. Any failure → STOP.
4c. Preflight summary — before full training, print and verify: reference URL, dataset columns, Hub target, monitoring target, NGC image, hardware, smoke loss/grad norm.
Gate: project files written, image built, smoke PASSED, preflight has no blank fields.
Step 5 — Train, evaluate, infer
Goal: baseline eval, full training, post-train eval, optional LoRA merge, 5
inference samples (all commands: references/docker-runs.md §4-8).
| Sub-step | docker-runs.md | Skip if |
|---|---|---|
| 5a. Baseline eval (zero-shot) | §4 | skip_baseline: true |
| 5b. Full training (detached) | §5 | — |
| 5c. LoRA merge | §6 | not VLM+LoRA |
| 5d. Post-train eval | §7 | — |
| 5e. Inference (5 samples) | §8 | — |
Multi-GPU: prepend torchrun --nproc_per_node=$gpu_count to python train.py.
While training streams, watch docker logs -f hft_train: loss should drop within
10-20 steps; flat loss (collator/label-masking bug), NaN (LR too high), and OOM
all stop the run — recovery in references/core-rules.md. If emit_report: true,
run report.py after Step 5e per references/reporting.md.
Gate: all of:
checkpoints/final/(orcheckpoints/merged/for LoRA) existsreports/eval_results.jsonhas a numeric primary metricreports/baseline_results.jsonexists (unless skipped)reports/inference_samples/has 5 samples- wandb URL shows descending loss
Step 6 — Push & emit rerun skill
Goal: publish the run and make it reproducible without re-research.
Push per references/hub-push.md (weights, model card, eval/baseline JSONs,
config.yaml, Dockerfile, requirements.txt, inference samples, reports when
emitted) unless push_to_hub: false is explicit. Emit
<output_dir>/skills/run-<short>/SKILL.md from
references/pipeline-skill-template.md — substitute every placeholder, include
full YAML metadata + the NVIDIA copyright HTML comment, and make any emitter fail
if those are missing.
Gate (Done criteria): all of:
- Step 5 gate met
- HF Hub repo exists at the resolved URL with weights + card +
results/(unlesspush_to_hub: false) <output_dir>/skills/run-<short>/SKILL.mdexists, no<placeholder>left, with metadata + copyright HTML comment perpipeline-skill-template.md
Final message: wandb URL, HF Hub URL, baseline -> fine-tuned primary metric,
reports/inference_samples/, and the rerun skill path.
Error playbook
On a known runtime error, consult the symptom → minimal-fix table in
references/error-playbook.md (NGC entrypoint, PyTorch/Transformers regressions,
numpy ABI, Albumentations bbox, PEFT/checkpointing, LoRA target breadth, CV
augmentation gaps, OOM at step 0) before redesigning anything. When a row there
fires twice across runs, lift it into compat-workarounds.md with a detect rule
— auto-applied in Step 1 before the error can fire.
Communication style
- Terse. No filler, no restating the request; one-word answers when appropriate.
- Always include direct Hub and wandb URLs when referencing artifacts.
- On error: state what went wrong, why, what you changed — no menus.
- Never present “Option A/B/C” for a request with a clear answer. Act.