--- base_model: unsloth/Qwen3-4B-Instruct-2507 library_name: peft pipeline_tag: text-generation tags: - base_model:adapter:unsloth/Qwen3-4B-Instruct-2507 - grpo - lora - qwen3 - swe-gym - search-replace-editing - transformers - trl - unsloth --- # Qwen3-4B SWE-Gym KL02 SFT20K Hard-Multi Best Holdout This repository contains a PEFT LoRA adapter checkpoint, not a standalone base model. It was produced during the Qwen3 SWE-Gym RL investigation in `/home/datta0/codes/ai/qwen3-grpo-patch`. ## Base Model - `unsloth/Qwen3-4B-Instruct-2507` ## Checkpoint Local source checkpoint before upload: ```text /mnt/disks/unslothai/datta0/cache/qwen3-grpo-patch/20260605_045145_swegym_q4b-kl02-sft20k-hardmulti_10e3a3b/checkpoints/best_holdout ``` This is the current best trainable Qwen3-4B adapter identified in the investigation: KL-GRPO beta 0.02 with hard-multi 20k Coder-30B teacher SFT. ## Evaluation Notes Held-out SWE-Gym patch-evaluation results recorded locally: | Evaluation | Greedy | Selected@1 | Pass@8 | Multi-file pass@8 | |---|---:|---:|---:|---:| | Seed 9012, 20k context | 9/35 | 10/35 | 16/35 | 4/17 | | Seed 5678, 20k context | 10/35 | 12/35 | 13/35 | 4/17 | The robust claim from this checkpoint is replicated multi-file pass@8 of `4/17`. The larger overall `16/35` run is a measured result, but should not be treated as fully robust without further re-sampling. ## Intended Use This checkpoint is for research on SWE-style patch generation using a search/replace edit contract. It should be loaded as a PEFT adapter on top of the base model. ## Limitations - This is an experimental research adapter. - The evaluation is on a small held-out slice and is sensitive to sampling and routing choices. - It is not a general-purpose coding assistant release.