knightnemo
/

wuji-hand-gesture-vam-ti2v5b-30l-openwam-fastwam-f33-r4-h32-ref65-refdrop10-mask-v2-propdrop10

Wuji Hand Gesture Vam Ti2V5B 30L Openwam Fastwam F33 R4 H32 Ref65 Refdrop10 Mask V2 Propdrop10

This repository contains one Wuji hand gesture VAM checkpoint from the May 26, 2026 OpenWAM/FastWAM sweep.

Identity

repo_id:              knightnemo/wuji-hand-gesture-vam-ti2v5b-30l-openwam-fastwam-f33-r4-h32-ref65-refdrop10-mask-v2-propdrop10
wandb project:        wuji_hand_gesture
wandb run id:         s4827q9y
wandb run name:       wuji_hand_gesture_vam_ti2v5b_30L_mask_v2_openwam_refdrop_0.1_propdrop_0.1_0526_1545
local training dir:   /cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_hand_gesture_vam_ti2v5b_30L_openwam_fastwam_f33_r4_h32_ref65_refdrop10_mask_v2_propdrop10
checkpoint:           step-10000.safetensors
checkpoint size:      12041754617 bytes
base model:           Wan-AI/Wan2.2-TI2V-5B
action expert style:  openwam
mask variant:         v2
action horizon:       32
proprio dropout:      0.1
reference dropout:    0.1

The checkpoint is a joint model: step-10000.safetensors contains the fine-tuned video DiT weights plus the action_dit.* action-stream weights. The Wan2.2-TI2V-5B base model is not included.

Files

step-10000.safetensors   final 10k-step checkpoint
model_config.json        compact machine-readable configuration and metrics
training_config.yaml     full W&B training config snapshot
wandb-summary.json       final scalar metrics exported by W&B
training_log_node0.txt   node-0 training log
README.md                this model card

Final Step Metrics

These are the scalar values in wandb-summary.json at step=10000.

Metric	Value
`val/loss`	0.334521
`val/loss_action`	0.272516
`val/loss_video`	0.0620054
`val/action_mse`	0.0499072
`val/action_mae`	0.149182
`val/video_mse`	98.6488
`val/video_psnr`	30.2544
`val/video_ssim`	0.961069
`val/video_lpips`	0.0246076
`train/loss`	0.00823434
`train/loss_action`	0.00358793
`train/loss_video`	0.00464641
`train/grad_norm`	0.953125
`_runtime`	15222

Best saved checkpoint by aggregate validation loss during the run:

step 4000: loss=0.173965, loss_video=0.056327, loss_action=0.117638

Only step-10000.safetensors is uploaded here, so use the best-saved entry only as training provenance unless that step is also uploaded separately.

Training Configuration

Key	Value
`dataset_type`	`wuji_gesture`
`wuji_robot_dataset_root`	`/cpfs/huangsq/wuji-hand-gestures-cropped`
`variant`	`clean_50`
`height`	`256`
`width`	`256`
`num_frames`	`33`
`action_video_freq_ratio`	`4`
`action_horizon`	`32`
`action_dim`	`20`
`proprio_dim`	`20`
`action_format`	`absolute`
`action_space`	`joint`
`proprio_space`	`action`
`action_pad_mode`	`last`
`action_expert_style`	`openwam`
`action_mot_backbone_pretrained_path`	`/cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/pretrained/ActionMoT_openwam_linear_interp_Wan22_alphascale_1024hdim.pt`
`mask_variant`	`v2`
`mask_tail_padding_loss`	`True`
`full_reference_video`	`True`
`max_ref_frames`	`65`
`reference_dropout`	`0.1`
`bridge_exclude_full_ref`	`True`
`extra_inputs`	`vace_reference_image,action_trajectory`
`target_camera`	`head_camera`
`reference_camera`	`head_camera`
`resize_mode`	`pad`
`backbone`	`ti2v`
`model_paths`	`["models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00001-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00002-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00003-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/models_t5_umt5-xxl-enc-bf16.pth","models/Wan-AI/Wan2.2-TI2V-5B/Wan2.2_VAE.pth"]`
`tokenizer_path`	`models/Wan-AI/Wan2.2-TI2V-5B/google/umt5-xxl`
`trainable_models`	`dit`
`learning_rate`	`5e-05`
`action_lr`	`None`
`weight_decay`	`0.01`
`warmup_steps`	`500`
`max_steps`	`10000`
`num_epochs`	`1`
`batch_size`	`1`
`gradient_accumulation_steps`	`1`
`dataset_repeat`	`1`
`dataset_num_workers`	`8`
`use_gradient_checkpointing`	`True`
`save_steps`	`1000`
`val_steps`	`200`
`video_log_steps`	`1000`
`max_val_samples`	`20`
`lambda_video`	`1`
`lambda_action`	`1`
`video_dim`	`3072`
`action_dit_dim`	`1024`
`action_dit_ffn_dim`	`4096`
`action_dit_num_heads`	`24`
`action_dit_num_layers`	`30`
`proprio_dropout`	`0.1`
`window_stride`	`1`
`val_ratio`	`0.1`
`output_path`	`/cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_hand_gesture_vam_ti2v5b_30L_openwam_fastwam_f33_r4_h32_ref65_refdrop10_mask_v2_propdrop10`
`wandb_project`	`wuji_hand_gesture`
`wandb_run_name`	`wuji_hand_gesture_vam_ti2v5b_30L_mask_v2_openwam_refdrop_0.1_propdrop_0.1_0526_1545`

Input/Output Contract

Expected inputs:

prompt:                 "the robot performs hand gesture {label}"
target camera:          head_camera
reference camera:       head_camera
target video frames:    33
full reference frames:  65
image resolution:       256 x 256
action horizon:         32
action/proprio dim:     20 / 20

Expected outputs:

robot-view target video rollout
20-D absolute robot action targets

Masking Note

This run uses mask_variant=v2 with the sequence layout:

[ref_video | first_frame | gen_video | action]

The run also records bridge_exclude_full_ref=True for provenance. For this OpenWAM/ActionMoT path, the active v2/v3 distinction is the mask_variant listed above.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video-to-Video

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for knightnemo/wuji-hand-gesture-vam-ti2v5b-30l-openwam-fastwam-f33-r4-h32-ref65-refdrop10-mask-v2-propdrop10

Base model

Wan-AI/Wan2.2-TI2V-5B

Finetuned

(53)

this model