tao-finetune-cosmos-embed
Cosmos-Embed1 video-text embedding for text-to-video retrieval, video-to-video search, semantic deduplication, and fine-tuning. Use when the user asks to "fine-tune Cosmos-Embed1", "run cosmos-embed inference", "export Cosmos-Embed1", "embed videos", or "search videos with text".
Skill body
Cosmos-Embed
Cosmos-Embed1 is a joint video-text embedder for text-to-video retrieval, video-to-video search, zero-shot/kNN classification, and semantic deduplication. The packaged CLI is cosmos-embed1 and supports train, evaluate, inference, and export.
Container image and per-action commands are in references/skill_info.yaml. Compact starting specs are in references/spec_template_*.yaml.
Train Action Policy
AutoML is not packaged for this model skill because there are no Cosmos-Embed schemas under schemas/. Always use the direct model skill actions for train, evaluate, inference, and export, even when a higher-level request includes automl_policy: on. Do not route Cosmos-Embed through workflow or AutoML skills until model-specific train schemas and templates are added.
Non-train actions such as evaluate, inference, export, and deploy flows stay in this model skill. The per-run automl_policy override does not change model metadata.
Quick Start
Use the published Cosmos-Embed container declared by references/skill_info.yaml
and resolved through versions.yaml. Do not build from the private
Cosmos-Embed1 source tree for normal skill use; build from source only when
developing the container itself.
TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
--skill-bank "$TAO_SKILL_BANK_PATH" \
--model tao-finetune-cosmos-embed \
--action train \
--format json |
python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
docker pull "$COSMOS_EMBED_IMAGE"
Expected local workspace layout:
workspace/
├── data/
│ ├── msrvtt_test_1k.json
│ └── video/
│ ├── video7020.mp4
│ └── ...
├── model/
│ └── Cosmos-Embed1-224p/ # optional if using HF repo id
├── specs/
│ ├── train.yaml
│ ├── evaluate.yaml
│ ├── inference.yaml
│ ├── export_onnx.yaml
│ └── export_hf.yaml
└── results/
Use these Docker options for all actions unless the local Docker/platform skill gives a stricter environment-specific command:
TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
--skill-bank "$TAO_SKILL_BANK_PATH" \
--model tao-finetune-cosmos-embed \
--action train \
--format json |
python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
RUN_ROOT="${RUN_ROOT:-$PWD}"
DOCKER_COMMON=(
--rm --gpus all --ipc=host --network=host
--shm-size=64g
--ulimit memlock=-1
--ulimit stack=67108864
-e HF_TOKEN
-e WANDB_DISABLED=true
-e WANDB_MODE=disabled
-e HUGGINGFACE_HUB_CACHE=/hf_cache
-v "$RUN_ROOT/data:/data:ro"
-v "$RUN_ROOT/model:/model"
-v "$RUN_ROOT/specs:/specs:ro"
-v "$RUN_ROOT/results:/results"
-v "$RUN_ROOT/hf_cache:/hf_cache"
)
For Cosmos-Embed images that ship protobuf==7.x, run a small startup
preamble before every action:
python -m pip install "protobuf<7"
The image contains wandb==0.21.0 with protobuf==7.x; importing W&B fails before training/evaluation unless protobuf is pinned below 7. Use WANDB_DISABLED=true and WANDB_MODE=disabled for smoke or offline runs. Cosmos-Embed may still download the public google-bert/bert-base-uncased Q-Former component even when the model checkpoint is disabled, so pass HF_TOKEN as an environment variable or mount a persistent HuggingFace cache. Do not write the token into specs, logs, or reports.
Train:
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
bash -lc "python -m pip install 'protobuf<7' && cosmos-embed1 train -e /specs/train.yaml results_dir=/results"
Evaluate:
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
bash -lc "python -m pip install 'protobuf<7' && cosmos-embed1 evaluate -e /specs/evaluate.yaml results_dir=/results"
Inference:
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
bash -lc "python -m pip install 'protobuf<7' && cosmos-embed1 inference -e /specs/inference.yaml \
'inference.query.input_texts=[\"a man is singing on stage\"]' \
inference.k=5 \
results_dir=/results"
Export ONNX:
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
bash -lc "python -m pip install 'protobuf<7' && cosmos-embed1 export -e /specs/export_onnx.yaml \
export.checkpoint=/results/train/checkpoints/iter_000000001.pt \
export.onnx_file=/results/export/cosmos_embed1_combined.onnx \
results_dir=/results"
Export HuggingFace format:
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
bash -lc "python -m pip install 'protobuf<7' && cosmos-embed1 export -e /specs/export_hf.yaml \
export.checkpoint=/results/train/checkpoints/iter_000000001.pt \
export.hf_output_dir=/results/export_hf/cosmos_embed1_hf \
results_dir=/results"
Smoke Overrides
For a small functional check, keep the same specs and override the expensive knobs:
train.max_iter=1
train.validation_iter=2
train.checkpoint_iter=1
train.optim.optim=adamw
train.optim.warmup_steps=0
train.optim.lr_decay_iters=1
dataset.train_dataset.batch_size=1
dataset.val_dataset.batch_size=1
dataset.train_dataset.workers=0
dataset.val_dataset.workers=0
When shortening the cosine scheduler for smoke runs, keep
train.optim.lr_decay_iters greater than train.optim.warmup_steps, or set
train.optim.warmup_steps=0 as shown above. The scheduler divides by
lr_decay_iters - warmup_steps, so equal values fail before the checkpoint is
written.
If no local Cosmos-Embed1 pretrained checkpoint is available, set model.pretrained_model_path=null for a plumbing-only smoke train. The model quality is meaningless in that mode, but the train/evaluate/inference/export action paths can still be exercised. In the current container, the Q-Former path can still fetch google-bert/bert-base-uncased; provide HF_TOKEN or a mounted HuggingFace cache for fresh ephemeral containers.
For evaluation and inference smoke tests on a tiny subset:
evaluate.callbacks.embedding_visualization=false
evaluate.callbacks.max_eval_samples=8
dataset.test_dataset.batch_size=1
dataset.test_dataset.workers=0
inference.k=2
dataset.inference_dataset.batch_size=1
dataset.inference_dataset.workers=0
Data Format
The MSR-VTT path expects a local video glob and a JSON metadata file:
dataset:
train_dataset:
dataset_type: msrvtt
mp4_urls: /data/video/*.mp4
metadata: /data/msrvtt_test_1k.json
List-format metadata rows must include at least video and caption:
{"video_id": "video7020", "video": "video7020.mp4", "caption": "a woman creating a fondant baby and flower"}
The dataset loader derives the video id from the local .mp4 filename and filters to videos present in the metadata. If a run finds zero videos, check that mp4_urls points to a container-local glob and that metadata video names match the filenames.
Model Weights
- Local HF directory: mount it under
/modeland setmodel.pretrained_model_path=/model/Cosmos-Embed1-224p. - HuggingFace repo: set
model.pretrained_model_path=nvidia/Cosmos-Embed1-224pand passHF_TOKENif access is gated. - Fine-tuned checkpoint: set downstream actions to the resolver-selected
/results/train/checkpoints/iter_#########.ptfile.
Training writes full checkpoints under results/train/checkpoints/iter_#########.pt, updates results/train/checkpoints/latest_checkpoint.txt, and creates a cosmos_embed1_model_latest.pth symlink. For evaluate.checkpoint, inference.checkpoint, export.checkpoint, and train.resume_training_checkpoint_path, resolve and pass the exact iter_#########.pt file for the intended iteration. The action spec templates intentionally leave these checkpoint fields null so the model-skill runner or the user must provide the resolver-selected checkpoint. Use the latest symlink only when the user explicitly asks for latest.
For single-GPU resume/retrain from a consolidated checkpoint, set model.fsdp_shard_size: 1. The container default is 8, which sends resumed training through an FSDP apply path that Cosmos-Embed1 does not implement for this model class.
Variants:
| Variant | Resolution | Frames | Embedding dim |
|---|---|---|---|
Cosmos-Embed1-224p |
224 x 224 | 8 | 256 |
Cosmos-Embed1-336p |
336 x 336 | 8 | 768 |
Cosmos-Embed1-448p |
448 x 448 | 8 | 768 |
Keep model.network.embed_dim, model.input_hw, and model.network.spatial_resolution aligned with the selected variant.
Important Parameters
| Parameter | Notes |
|---|---|
train.num_gpus |
1 for single GPU, >1 auto-launches torchrun, -1 auto-detects visible GPUs. |
train.max_iter |
Main training length. Use 1 only for smoke testing. |
train.optim.optim |
fused_adamw is faster when available; adamw is safer for smoke and portability. |
model.lora.enabled |
Enables LoRA. Set model.network.visual_encoder.transformer_engine=false when LoRA is on. |
model.lora.lora_rank |
LoRA rank. Start with 8; try 4, 8, or 16 for manual or AutoML-style sweeps. |
model.lora.lora_alpha |
LoRA scaling factor. Start with 16; keep near 2 * lora_rank unless experiments show otherwise. |
model.lora.lora_dropout |
LoRA dropout. Start with 0.1; sweep 0.0, 0.05, and 0.1 for small datasets. |
model.lora.bias |
Bias policy: none, all, or lora_only. Keep none unless intentionally training biases. |
model.lora.use_rslora / use_dora |
Optional LoRA variants. Enable one at a time and record the setting with the checkpoint. |
model.lora.target_modules |
Optional module-name patterns for LoRA injection. Leave empty for the default ViT + Q-Former attention/MLP targets. |
model.lora.modules_to_save |
Optional modules to keep fully trainable alongside LoRA. Leave empty unless preserving a task-specific head. |
evaluate.load_dataset_pkl / save_dataset_pkl |
Cache evaluation embeddings. |
inference.load_dataset_pkl / save_dataset_pkl |
Cache the search database for repeated retrieval. |
export.mode |
video, text, combined, or huggingface. |
export.on_cpu |
Recommended for export to avoid device mismatch issues. |
LoRA and AutoML Notes
For parameter-efficient fine-tuning, set model.lora.enabled=true and keep
model.network.visual_encoder.transformer_engine=false; TAO Core’s
Cosmos-Embed1 config notes that PEFT cannot inject adapters into Transformer
Engine layers. Treat the LoRA fields above as the first candidate parameters
for manual tuning or AutoML-style search before unfreezing larger model blocks.
Avoid changing target_modules or modules_to_save unless the user explicitly
needs custom adapter placement.
S3 Staging
The Cosmos-Embed1 CLI consumes local paths and Python globs, not raw s3://.../*.mp4 URIs. For S3-backed runs, first stage a subset or full dataset to the execution host/container filesystem, then use local paths such as /data/video/*.mp4 in the spec.
Recommended S3 layout for staged MSR-VTT data:
s3://bucket/path/cosmos-embed/msrvtt-subset/
├── msrvtt_test_1k.json
└── video/
├── video7020.mp4
└── ...
After downloading/syncing that prefix into the mounted data/ directory, use the same Docker commands above.
Outputs
results/
├── train/
│ ├── cosmos_embed1_model_latest.pth
│ ├── cosmos_embed1_model_<iter>.pth
│ └── experiment.yaml
├── evaluate/
│ ├── metrics.json
│ └── experiment.yaml
├── inference/
│ ├── results.json
│ └── experiment.yaml
├── export/
│ ├── cosmos_embed1_combined.onnx
│ └── export_config.yaml
└── export_hf/
└── cosmos_embed1_hf/
Known Pitfalls
| Symptom | Cause | Fix |
|---|---|---|
MSRVTTDataset: 0 videos found |
mp4_urls is not a local glob or metadata filenames do not match videos. |
Mount data into the container and set mp4_urls=/data/video/*.mp4. |
| HF download/auth failure | Missing or invalid HF_TOKEN, or model agreement not accepted. |
Accept the model terms and pass -e HF_TOKEN. |
cannot import name 'Imports' from 'wandb.proto.wandb_telemetry_pb2' |
wandb==0.21.0 in the container is incompatible with protobuf==7.x. |
Run python -m pip install "protobuf<7" in the container before invoking cosmos-embed1. |
Resume fails with Model does not implement 'apply_fsdp' |
Single-GPU resume loaded a consolidated checkpoint while model.fsdp_shard_size stayed at the default 8. |
Set model.fsdp_shard_size=1 for local single-GPU resume/retrain. |
| LoRA injection failure | Transformer Engine visual encoder is enabled. | Set model.network.visual_encoder.transformer_engine=false. |
| ONNX/HF export complains about missing components | Export checkpoint is partial or adapter-only. | Use a full checkpoint or configure pretrained visual/text sources before export. |
| CUDA OOM | Batch/resolution too high for the GPU. | Reduce batch size, use 224p, enable LoRA, or use more GPUs. |