---
base_model: unsloth/Qwen3-4B-Instruct-2507
library_name: peft
pipeline_tag: text-generation
tags:
- base_model:adapter:unsloth/Qwen3-4B-Instruct-2507
- grpo
- lora
- qwen3
- swe-gym
- search-replace-editing
- transformers
- trl
- unsloth
---

# Qwen3-4B SWE-Gym KL02 SFT20K Hard-Multi Best Holdout

This repository contains a PEFT LoRA adapter checkpoint, not a standalone base model.
It was produced during the Qwen3 SWE-Gym RL investigation in `/home/datta0/codes/ai/qwen3-grpo-patch`.

## Base Model

- `unsloth/Qwen3-4B-Instruct-2507`

## Checkpoint

Local source checkpoint before upload:

```text
/mnt/disks/unslothai/datta0/cache/qwen3-grpo-patch/20260605_045145_swegym_q4b-kl02-sft20k-hardmulti_10e3a3b/checkpoints/best_holdout
```

This is the current best trainable Qwen3-4B adapter identified in the investigation: KL-GRPO beta 0.02 with hard-multi 20k Coder-30B teacher SFT.

## Evaluation Notes

Held-out SWE-Gym patch-evaluation results recorded locally:

| Evaluation | Greedy | Selected@1 | Pass@8 | Multi-file pass@8 |
|---|---:|---:|---:|---:|
| Seed 9012, 20k context | 9/35 | 10/35 | 16/35 | 4/17 |
| Seed 5678, 20k context | 10/35 | 12/35 | 13/35 | 4/17 |

The robust claim from this checkpoint is replicated multi-file pass@8 of `4/17`. The larger overall `16/35` run is a measured result, but should not be treated as fully robust without further re-sampling.

## Intended Use

This checkpoint is for research on SWE-style patch generation using a search/replace edit contract. It should be loaded as a PEFT adapter on top of the base model.

## Limitations

- This is an experimental research adapter.
- The evaluation is on a small held-out slice and is sensitive to sampling and routing choices.
- It is not a general-purpose coding assistant release.