Huihui-Qwen3.6-35B-A3B-abliterated-FP8-DYNAMIC
FP8 dynamic quantization of huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated, produced for inference on NVIDIA DGX Spark (GB10, SM121) under its 128 GB unified-memory budget.
This is the clean abliteration variant quantized to FP8, distinct from existing Claude-distilled variants on the Hub. Uses llm-compressor FP8_DYNAMIC scheme + compressed-tensors float-quantized format.
⚠️ Vision pipeline fix (2026-05-03)
Pre-2026-05-03 uploads of this checkpoint had broken vision input — the model would emit
!!!!!token loops on any image. Root cause: vLLM (≤0.20.0) expects vision tensor keys atvisual.*, butQwen3_5MoeForConditionalGenerationsaves them undermodel.language_model.visual.*, so all 333 vision-tower tensors got silently skipped during load.This is now fixed: shard 19 +
model.safetensors.index.jsonre-uploaded with the visual prefix stripped. Text-only output is unaffected.If you cloned the model before 2026-05-03, the cleanest path is
huggingface_hub.snapshot_download(..., force_download=True, allow_patterns=["model-00019-of-00019.safetensors", "model.safetensors.index.json"])— only ~600 MB of changed weights.Diagnosis + remap script: see
remap_visual_path.pyin this repo.
Quick stats
| Metric | Value |
|---|---|
| Base | huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated (BF16, 67 GB) |
| Format | compressed-tensors float-quantized (FP8 e4m3fn weights, dynamic per-token activation) |
| Disk size | ~36 GB (1.86× shrink from BF16) |
| Quant scheme | FP8_DYNAMIC via llm-compressor 0.10 |
| Quantized layers | 31,030 Linear modules → FP8 e4m3fn |
| Kept BF16 | lm_head, visual.*, mlp.gate (router), shared_expert_gate |
| MTP weights | Preserved (model_mtp.safetensors) for speculative decoding |
| Modality | Text + Image (vision tower preserved BF16, see fix note above) |
Performance on DGX Spark (GB10, vLLM)
Single-stream decode, 200 tokens, GB10 (1× SM121, 128 GB UMA), vllm-node-tf5 build:
| Configuration | tok/s | vs BF16 |
|---|---|---|
| BF16 abliterated (this base) | 30.71 | 1.00× |
| This FP8 + MTP speculative | 51.72 | 1.68× |
Speculative decoding (qwen3_next_mtp, num_speculative_tokens=2) is the dominant contribution.
Recommended vLLM launch (GB10)
vllm serve coolthor/Huihui-Qwen3.6-35B-A3B-abliterated-FP8-DYNAMIC \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Notes:
- Do not add
--kv-cache-dtype fp8on GB10 — FP8 KV cache has known accuracy/repetition issues on SM121. - Vision encoder is functional after the 2026-05-03 fix (drop
--language-model-onlyif you want to feed images).
Multi-GPU users
For datacenter Blackwell (B200, RTX PRO 6000 with tensor-parallel-size > 1), follow the official Qwen 3.6 FP8 model card recipe and replace the model path. This artifact's FP8 layout is standard compressed-tensors, so it loads on any vLLM build that supports the architecture.
Quantization recipe
from compressed_tensors.utils import save_mtp_tensors_to_checkpoint
from transformers import AutoProcessor, Qwen3_5MoeForConditionalGeneration
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
MODEL_PATH = "huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated"
SAVE_DIR = "Huihui-Qwen3.6-35B-A3B-abliterated-FP8-DYNAMIC"
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
MODEL_PATH, dtype="auto", low_cpu_mem_usage=True,
)
processor = AutoProcessor.from_pretrained(MODEL_PATH)
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=[
"re:.*lm_head",
"re:visual.*",
"re:model.visual.*",
"re:.*mlp.gate$",
"re:.*shared_expert_gate$",
],
)
oneshot(model=model, recipe=recipe)
model.save_pretrained(SAVE_DIR, max_shard_size="2GB", safe_serialization=True)
processor.save_pretrained(SAVE_DIR)
save_mtp_tensors_to_checkpoint(source_model=MODEL_PATH, dest_dir=SAVE_DIR)
Three gotchas you'll hit
Qwen3_5MoeForConditionalGenerationwrites a triplelanguage_model.prefix in keys that vLLM's loader can't match (looks formodel.language_model.). Strip one extralanguage_model.language_model.substring from each key in the saved safetensors before serving. (This is the LM-body fix; the visual tower needs an additional strip — see #2.)Vision tower keys need an additional prefix strip.
Qwen3_5MoeForConditionalGenerationwrites vision weights asmodel.language_model.visual.*, but vLLM'sqwen3_5.pyloader looks forvisual.*(no prefix). All 333 vision-tower tensors will silently skip-load and any image input produces!!!!!token loops with text-only output unaffected. Strip themodel.language_model.prefix from every key matchingmodel.language_model.visual.*in both the relevant safetensors shard andmodel.safetensors.index.json. The full remap script is atremap_visual_path.pyin this repo (~50 lines, ~2 sec runtime — only one shard contains visual tensors, the other 18 can be hard-linked unchanged).128 GB UMA is tight for save_pretrained —
max_shard_size="2GB"is required; the default 50 GB shard buffer + GPU pool overlap will OOM-kill on Spark.
The full debugging journey (4 OOMs across 4 attempts + the vision-prefix fix) is documented in the accompanying blog post: Self-quantizing 35B abliterated MoE to FP8 on DGX Spark.
Verifying the vision fix
After the 2026-05-03 re-upload, vision input should work end-to-end. Quick smoke test:
import base64, json, urllib.request
with open("test.png", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
payload = {
"model": "qwen36-abliterated",
"messages": [{"role": "user", "content": [
{"type": "text", "text": "Describe this image in one sentence."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
]}],
"max_tokens": 200,
}
req = urllib.request.Request(
"http://localhost:8000/v1/chat/completions",
data=json.dumps(payload).encode(),
headers={"Content-Type": "application/json"},
)
print(json.loads(urllib.request.urlopen(req).read())["choices"][0]["message"]["content"])
A working response returns a coherent description; a !!!!! loop means you're still hitting the prefix issue (your local copy is pre-fix — re-pull shard 19 + index.json).
Abliteration sanity
Smoke-tested with prompts vanilla Qwen typically softens with disclaimers (villain monologue, technical security topics, dark humor, persona requests). All five test prompts returned direct content with zero refusal markers. Identity-level "are you uncensored?" returns the standard Qwen self-description (training-data residue), but functional refusal behavior is suppressed.
License
Inherits from base model. The original abliteration is Apache-2.0 (Huihui's remove-refusals-with-transformers derivative). Qwen 3.6 base is Apache-2.0.
⚠ Safety: This model has reduced refusal behavior. Use responsibly.
Credits
- Base: huihui-ai (abliteration)
- Source model: Qwen Team / Alibaba
- Quantization: this checkpoint, by coolthor
- Tooling: llm-compressor, compressed-tensors, vLLM
Changelog
- 2026-05-03: Vision pipeline fix. Re-uploaded
model-00019-of-00019.safetensorsandmodel.safetensors.index.jsonwithvisual.*prefix corrected; addedremap_visual_path.pyand updated docs. No change to text-only behavior or benchmarks. Earlier downloads silently skipped 333 vision-tower tensors on load. - 2026-04-28: Initial v7-fixed FP8 dynamic upload. 51.72 tok/s with MTP speculative.
☕ If this saved you GPU hours, you can buy me a coffee.
- Downloads last month
- 1,259
Model tree for coolthor/Huihui-Qwen3.6-35B-A3B-abliterated-FP8-DYNAMIC
Base model
Qwen/Qwen3.6-35B-A3B