Wuji Hand Gesture Vam Ti2V5B 30L Openwam Fastwam F33 R4 H32 Ref65 Refdrop10 Mask V3 Propdrop30

This repository contains one Wuji hand gesture VAM checkpoint from the May 26, 2026 OpenWAM/FastWAM sweep.

Identity

repo_id:              knightnemo/wuji-hand-gesture-vam-ti2v5b-30l-openwam-fastwam-f33-r4-h32-ref65-refdrop10-mask-v3-propdrop30
wandb project:        wuji_hand_gesture
wandb run id:         kigtihyd
wandb run name:       wuji_hand_gesture_vam_ti2v5b_30L_mask_v3_openwam_refdrop_0.1_propdrop_0.3_0526_1545
local training dir:   /cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_hand_gesture_vam_ti2v5b_30L_openwam_fastwam_f33_r4_h32_ref65_refdrop10_mask_v3_propdrop30
checkpoint:           step-10000.safetensors
checkpoint size:      12041754617 bytes
base model:           Wan-AI/Wan2.2-TI2V-5B
action expert style:  openwam
mask variant:         v3
action horizon:       32
proprio dropout:      0.3
reference dropout:    0.1

The checkpoint is a joint model: step-10000.safetensors contains the fine-tuned video DiT weights plus the action_dit.* action-stream weights. The Wan2.2-TI2V-5B base model is not included.

Files

step-10000.safetensors   final 10k-step checkpoint
model_config.json        compact machine-readable configuration and metrics
training_config.yaml     full W&B training config snapshot
wandb-summary.json       final scalar metrics exported by W&B
training_log_node0.txt   node-0 training log
README.md                this model card

Final Step Metrics

These are the scalar values in wandb-summary.json at step=10000.

Metric Value
val/loss 0.299837
val/loss_action 0.246712
val/loss_video 0.0531253
val/action_mse 0.0568796
val/action_mae 0.157043
val/video_mse 131.336
val/video_psnr 29.6155
val/video_ssim 0.958645
val/video_lpips 0.0298017
train/loss 0.0483059
train/loss_action 0.000125322
train/loss_video 0.0481806
train/grad_norm 0.660156
_runtime 15104.1

Best saved checkpoint by aggregate validation loss during the run:

step 9000: loss=0.203756, loss_video=0.048469, loss_action=0.155286

Only step-10000.safetensors is uploaded here, so use the best-saved entry only as training provenance unless that step is also uploaded separately.

Training Configuration

Key Value
dataset_type wuji_gesture
wuji_robot_dataset_root /cpfs/huangsq/wuji-hand-gestures-cropped
variant clean_50
height 256
width 256
num_frames 33
action_video_freq_ratio 4
action_horizon 32
action_dim 20
proprio_dim 20
action_format absolute
action_space joint
proprio_space action
action_pad_mode last
action_expert_style openwam
action_mot_backbone_pretrained_path /cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/pretrained/ActionMoT_openwam_linear_interp_Wan22_alphascale_1024hdim.pt
mask_variant v3
mask_tail_padding_loss True
full_reference_video True
max_ref_frames 65
reference_dropout 0.1
bridge_exclude_full_ref True
extra_inputs vace_reference_image,action_trajectory
target_camera head_camera
reference_camera head_camera
resize_mode pad
backbone ti2v
model_paths ["models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00001-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00002-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00003-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/models_t5_umt5-xxl-enc-bf16.pth","models/Wan-AI/Wan2.2-TI2V-5B/Wan2.2_VAE.pth"]
tokenizer_path models/Wan-AI/Wan2.2-TI2V-5B/google/umt5-xxl
trainable_models dit
learning_rate 5e-05
action_lr None
weight_decay 0.01
warmup_steps 500
max_steps 10000
num_epochs 1
batch_size 1
gradient_accumulation_steps 1
dataset_repeat 1
dataset_num_workers 8
use_gradient_checkpointing True
save_steps 1000
val_steps 200
video_log_steps 1000
max_val_samples 20
lambda_video 1
lambda_action 1
video_dim 3072
action_dit_dim 1024
action_dit_ffn_dim 4096
action_dit_num_heads 24
action_dit_num_layers 30
proprio_dropout 0.3
window_stride 1
val_ratio 0.1
output_path /cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_hand_gesture_vam_ti2v5b_30L_openwam_fastwam_f33_r4_h32_ref65_refdrop10_mask_v3_propdrop30
wandb_project wuji_hand_gesture
wandb_run_name wuji_hand_gesture_vam_ti2v5b_30L_mask_v3_openwam_refdrop_0.1_propdrop_0.3_0526_1545

Input/Output Contract

Expected inputs:

prompt:                 "the robot performs hand gesture {label}"
target camera:          head_camera
reference camera:       head_camera
target video frames:    33
full reference frames:  65
image resolution:       256 x 256
action horizon:         32
action/proprio dim:     20 / 20

Expected outputs:

robot-view target video rollout
20-D absolute robot action targets

Masking Note

This run uses mask_variant=v3 with the sequence layout:

[ref_video | first_frame | gen_video | action]

The run also records bridge_exclude_full_ref=True for provenance. For this OpenWAM/ActionMoT path, the active v2/v3 distinction is the mask_variant listed above.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for knightnemo/wuji-hand-gesture-vam-ti2v5b-30l-openwam-fastwam-f33-r4-h32-ref65-refdrop10-mask-v3-propdrop30

Finetuned
(53)
this model