Huihui-Qwen3.6-35B-A3B-abliterated-FP8-DYNAMIC

FP8 dynamic quantization of huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated, produced for inference on NVIDIA DGX Spark (GB10, SM121) under its 128 GB unified-memory budget.

This is the clean abliteration variant quantized to FP8, distinct from existing Claude-distilled variants on the Hub. Uses llm-compressor FP8_DYNAMIC scheme + compressed-tensors float-quantized format.

⚠️ Vision pipeline fix (2026-05-03)

Pre-2026-05-03 uploads of this checkpoint had broken vision input — the model would emit !!!!! token loops on any image. Root cause: vLLM (≤0.20.0) expects vision tensor keys at visual.*, but Qwen3_5MoeForConditionalGeneration saves them under model.language_model.visual.*, so all 333 vision-tower tensors got silently skipped during load.

This is now fixed: shard 19 + model.safetensors.index.json re-uploaded with the visual prefix stripped. Text-only output is unaffected.

If you cloned the model before 2026-05-03, the cleanest path is huggingface_hub.snapshot_download(..., force_download=True, allow_patterns=["model-00019-of-00019.safetensors", "model.safetensors.index.json"]) — only ~600 MB of changed weights.

Diagnosis + remap script: see remap_visual_path.py in this repo.

Quick stats

Metric Value
Base huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated (BF16, 67 GB)
Format compressed-tensors float-quantized (FP8 e4m3fn weights, dynamic per-token activation)
Disk size ~36 GB (1.86× shrink from BF16)
Quant scheme FP8_DYNAMIC via llm-compressor 0.10
Quantized layers 31,030 Linear modules → FP8 e4m3fn
Kept BF16 lm_head, visual.*, mlp.gate (router), shared_expert_gate
MTP weights Preserved (model_mtp.safetensors) for speculative decoding
Modality Text + Image (vision tower preserved BF16, see fix note above)

Performance on DGX Spark (GB10, vLLM)

Single-stream decode, 200 tokens, GB10 (1× SM121, 128 GB UMA), vllm-node-tf5 build:

Configuration tok/s vs BF16
BF16 abliterated (this base) 30.71 1.00×
This FP8 + MTP speculative 51.72 1.68×

Speculative decoding (qwen3_next_mtp, num_speculative_tokens=2) is the dominant contribution.

Recommended vLLM launch (GB10)

vllm serve coolthor/Huihui-Qwen3.6-35B-A3B-abliterated-FP8-DYNAMIC \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Notes:

  • Do not add --kv-cache-dtype fp8 on GB10 — FP8 KV cache has known accuracy/repetition issues on SM121.
  • Vision encoder is functional after the 2026-05-03 fix (drop --language-model-only if you want to feed images).

Multi-GPU users

For datacenter Blackwell (B200, RTX PRO 6000 with tensor-parallel-size > 1), follow the official Qwen 3.6 FP8 model card recipe and replace the model path. This artifact's FP8 layout is standard compressed-tensors, so it loads on any vLLM build that supports the architecture.

Quantization recipe

from compressed_tensors.utils import save_mtp_tensors_to_checkpoint
from transformers import AutoProcessor, Qwen3_5MoeForConditionalGeneration
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_PATH = "huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated"
SAVE_DIR = "Huihui-Qwen3.6-35B-A3B-abliterated-FP8-DYNAMIC"

model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
    MODEL_PATH, dtype="auto", low_cpu_mem_usage=True,
)
processor = AutoProcessor.from_pretrained(MODEL_PATH)

recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=[
        "re:.*lm_head",
        "re:visual.*",
        "re:model.visual.*",
        "re:.*mlp.gate$",
        "re:.*shared_expert_gate$",
    ],
)

oneshot(model=model, recipe=recipe)
model.save_pretrained(SAVE_DIR, max_shard_size="2GB", safe_serialization=True)
processor.save_pretrained(SAVE_DIR)
save_mtp_tensors_to_checkpoint(source_model=MODEL_PATH, dest_dir=SAVE_DIR)

Three gotchas you'll hit

  1. Qwen3_5MoeForConditionalGeneration writes a triple language_model. prefix in keys that vLLM's loader can't match (looks for model.language_model.). Strip one extra language_model.language_model. substring from each key in the saved safetensors before serving. (This is the LM-body fix; the visual tower needs an additional strip — see #2.)

  2. Vision tower keys need an additional prefix strip. Qwen3_5MoeForConditionalGeneration writes vision weights as model.language_model.visual.*, but vLLM's qwen3_5.py loader looks for visual.* (no prefix). All 333 vision-tower tensors will silently skip-load and any image input produces !!!!! token loops with text-only output unaffected. Strip the model.language_model. prefix from every key matching model.language_model.visual.* in both the relevant safetensors shard and model.safetensors.index.json. The full remap script is at remap_visual_path.py in this repo (~50 lines, ~2 sec runtime — only one shard contains visual tensors, the other 18 can be hard-linked unchanged).

  3. 128 GB UMA is tight for save_pretrainedmax_shard_size="2GB" is required; the default 50 GB shard buffer + GPU pool overlap will OOM-kill on Spark.

The full debugging journey (4 OOMs across 4 attempts + the vision-prefix fix) is documented in the accompanying blog post: Self-quantizing 35B abliterated MoE to FP8 on DGX Spark.

Verifying the vision fix

After the 2026-05-03 re-upload, vision input should work end-to-end. Quick smoke test:

import base64, json, urllib.request

with open("test.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

payload = {
    "model": "qwen36-abliterated",
    "messages": [{"role": "user", "content": [
        {"type": "text", "text": "Describe this image in one sentence."},
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
    ]}],
    "max_tokens": 200,
}
req = urllib.request.Request(
    "http://localhost:8000/v1/chat/completions",
    data=json.dumps(payload).encode(),
    headers={"Content-Type": "application/json"},
)
print(json.loads(urllib.request.urlopen(req).read())["choices"][0]["message"]["content"])

A working response returns a coherent description; a !!!!! loop means you're still hitting the prefix issue (your local copy is pre-fix — re-pull shard 19 + index.json).

Abliteration sanity

Smoke-tested with prompts vanilla Qwen typically softens with disclaimers (villain monologue, technical security topics, dark humor, persona requests). All five test prompts returned direct content with zero refusal markers. Identity-level "are you uncensored?" returns the standard Qwen self-description (training-data residue), but functional refusal behavior is suppressed.

License

Inherits from base model. The original abliteration is Apache-2.0 (Huihui's remove-refusals-with-transformers derivative). Qwen 3.6 base is Apache-2.0.

Safety: This model has reduced refusal behavior. Use responsibly.

Credits

Changelog

  • 2026-05-03: Vision pipeline fix. Re-uploaded model-00019-of-00019.safetensors and model.safetensors.index.json with visual.* prefix corrected; added remap_visual_path.py and updated docs. No change to text-only behavior or benchmarks. Earlier downloads silently skipped 333 vision-tower tensors on load.
  • 2026-04-28: Initial v7-fixed FP8 dynamic upload. 51.72 tok/s with MTP speculative.

☕ If this saved you GPU hours, you can buy me a coffee.

Downloads last month
1,259
Safetensors
Model size
36B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for coolthor/Huihui-Qwen3.6-35B-A3B-abliterated-FP8-DYNAMIC

Quantized
(15)
this model