Wuji Hand Gesture Vam Ti2V5B 30L Openwam Fastwam F33 R4 H32 Ref65 Refdrop10 Mask V2 Propdrop10
This repository contains one Wuji hand gesture VAM checkpoint from the May 26, 2026 OpenWAM/FastWAM sweep.
Identity
repo_id: knightnemo/wuji-hand-gesture-vam-ti2v5b-30l-openwam-fastwam-f33-r4-h32-ref65-refdrop10-mask-v2-propdrop10
wandb project: wuji_hand_gesture
wandb run id: s4827q9y
wandb run name: wuji_hand_gesture_vam_ti2v5b_30L_mask_v2_openwam_refdrop_0.1_propdrop_0.1_0526_1545
local training dir: /cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_hand_gesture_vam_ti2v5b_30L_openwam_fastwam_f33_r4_h32_ref65_refdrop10_mask_v2_propdrop10
checkpoint: step-10000.safetensors
checkpoint size: 12041754617 bytes
base model: Wan-AI/Wan2.2-TI2V-5B
action expert style: openwam
mask variant: v2
action horizon: 32
proprio dropout: 0.1
reference dropout: 0.1
The checkpoint is a joint model: step-10000.safetensors contains the
fine-tuned video DiT weights plus the action_dit.* action-stream weights.
The Wan2.2-TI2V-5B base model is not included.
Files
step-10000.safetensors final 10k-step checkpoint
model_config.json compact machine-readable configuration and metrics
training_config.yaml full W&B training config snapshot
wandb-summary.json final scalar metrics exported by W&B
training_log_node0.txt node-0 training log
README.md this model card
Final Step Metrics
These are the scalar values in wandb-summary.json at step=10000.
| Metric | Value |
|---|---|
val/loss |
0.334521 |
val/loss_action |
0.272516 |
val/loss_video |
0.0620054 |
val/action_mse |
0.0499072 |
val/action_mae |
0.149182 |
val/video_mse |
98.6488 |
val/video_psnr |
30.2544 |
val/video_ssim |
0.961069 |
val/video_lpips |
0.0246076 |
train/loss |
0.00823434 |
train/loss_action |
0.00358793 |
train/loss_video |
0.00464641 |
train/grad_norm |
0.953125 |
_runtime |
15222 |
Best saved checkpoint by aggregate validation loss during the run:
step 4000: loss=0.173965, loss_video=0.056327, loss_action=0.117638
Only step-10000.safetensors is uploaded here, so use the best-saved entry
only as training provenance unless that step is also uploaded separately.
Training Configuration
| Key | Value |
|---|---|
dataset_type |
wuji_gesture |
wuji_robot_dataset_root |
/cpfs/huangsq/wuji-hand-gestures-cropped |
variant |
clean_50 |
height |
256 |
width |
256 |
num_frames |
33 |
action_video_freq_ratio |
4 |
action_horizon |
32 |
action_dim |
20 |
proprio_dim |
20 |
action_format |
absolute |
action_space |
joint |
proprio_space |
action |
action_pad_mode |
last |
action_expert_style |
openwam |
action_mot_backbone_pretrained_path |
/cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/pretrained/ActionMoT_openwam_linear_interp_Wan22_alphascale_1024hdim.pt |
mask_variant |
v2 |
mask_tail_padding_loss |
True |
full_reference_video |
True |
max_ref_frames |
65 |
reference_dropout |
0.1 |
bridge_exclude_full_ref |
True |
extra_inputs |
vace_reference_image,action_trajectory |
target_camera |
head_camera |
reference_camera |
head_camera |
resize_mode |
pad |
backbone |
ti2v |
model_paths |
["models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00001-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00002-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00003-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/models_t5_umt5-xxl-enc-bf16.pth","models/Wan-AI/Wan2.2-TI2V-5B/Wan2.2_VAE.pth"] |
tokenizer_path |
models/Wan-AI/Wan2.2-TI2V-5B/google/umt5-xxl |
trainable_models |
dit |
learning_rate |
5e-05 |
action_lr |
None |
weight_decay |
0.01 |
warmup_steps |
500 |
max_steps |
10000 |
num_epochs |
1 |
batch_size |
1 |
gradient_accumulation_steps |
1 |
dataset_repeat |
1 |
dataset_num_workers |
8 |
use_gradient_checkpointing |
True |
save_steps |
1000 |
val_steps |
200 |
video_log_steps |
1000 |
max_val_samples |
20 |
lambda_video |
1 |
lambda_action |
1 |
video_dim |
3072 |
action_dit_dim |
1024 |
action_dit_ffn_dim |
4096 |
action_dit_num_heads |
24 |
action_dit_num_layers |
30 |
proprio_dropout |
0.1 |
window_stride |
1 |
val_ratio |
0.1 |
output_path |
/cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_hand_gesture_vam_ti2v5b_30L_openwam_fastwam_f33_r4_h32_ref65_refdrop10_mask_v2_propdrop10 |
wandb_project |
wuji_hand_gesture |
wandb_run_name |
wuji_hand_gesture_vam_ti2v5b_30L_mask_v2_openwam_refdrop_0.1_propdrop_0.1_0526_1545 |
Input/Output Contract
Expected inputs:
prompt: "the robot performs hand gesture {label}"
target camera: head_camera
reference camera: head_camera
target video frames: 33
full reference frames: 65
image resolution: 256 x 256
action horizon: 32
action/proprio dim: 20 / 20
Expected outputs:
robot-view target video rollout
20-D absolute robot action targets
Masking Note
This run uses mask_variant=v2 with the sequence layout:
[ref_video | first_frame | gen_video | action]
The run also records bridge_exclude_full_ref=True for
provenance. For this OpenWAM/ActionMoT path, the active v2/v3 distinction is
the mask_variant listed above.
Model tree for knightnemo/wuji-hand-gesture-vam-ti2v5b-30l-openwam-fastwam-f33-r4-h32-ref65-refdrop10-mask-v2-propdrop10
Base model
Wan-AI/Wan2.2-TI2V-5B