Instructions to use OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4") model = AutoModelForImageTextToText.from_pretrained("OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4
- SGLang
How to use OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 with Docker Model Runner:
docker model run hf.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4
Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4
Overview
4-bit NVFP4 quantization of OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated — the Kimi-K2.6-distilled, reasoning-DPO-healed, abliterated/uncensored evolution of Qwen/Qwen3.5-122B-A10B (Mixture of Experts, ~10B active / 122B total).
This build packs the transformer weights to NVFP4 with LLM Compressor, cutting the on-disk footprint from ~250 GB to ≈82 GB while keeping the vision tower, MTP head, router gates, and the Gated-DeltaNet attention path in higher precision. It is multimodal (image + text), uncensored, and — despite 4-bit weights — beats the full-precision Qwen3.5-122B-A10B baseline on every benchmark we ran (see Evaluation).
It loads anywhere compressed-tensors is supported and is auto-detected by vLLM (no --quantization flag needed).
Evaluation
Scores below were measured on this NVFP4 build and compared against the full-precision (BF16) Qwen/Qwen3.5-122B-A10B baseline:
| Benchmark | Qwen3.5-122B-A10B (BF16, baseline) | Qwopus3.5 NVFP4 (this model) |
|---|---|---|
| CTI | 64.8 | 71.5 |
| LiveCodeBench | 78.9 | 79.9 |
| BFCL | 72.2 | 85.6 |
Even after 4-bit (NVFP4) weight quantization, this model outperforms the BF16 Qwen3.5-122B-A10B baseline on all three benchmarks — the Kimi-K2.6 distillation + reasoning-DPO healing more than offsets any quantization loss. BFCL is the Berkeley Function-Calling Leaderboard (tool use); LiveCodeBench is contamination-controlled code generation.
Quantization (NVFP4)
Produced with LLM Compressor using the QuantizationModifier recipe shipped in this repo (recipe.yaml).
- Scheme:
NVFP4(format: nvfp4-pack-quantized) — 4-bit float weights in micro-blocks of 16, each block carrying an FP8 (float8_e4m3fn) scale. Weights are static; input activations are quantized dynamically (per-group, static-minmax). - Quantized: all transformer
Linearlayers — attention projections and the 256 routed-expert MoE FFNs (37,056 packed weight tensors). - Left in higher precision (BF16): the vision tower (
visual.*— 333 tensors), the MTP head (model_mtp.safetensors— 785 tensors),lm_head, token embeddings, the MoE router gates (mlp.gate,shared_expert_gate), and the Gated-DeltaNet linear-attention path (linear_attn.*). - Architecture preserved:
Qwen3_5MoeForConditionalGeneration/model_type: qwen3_5_moe, so the checkpoint loads as a drop-in replacement for the base at the architecture level.
Downloads / Other Formats
| Format | Repo | Use it for |
|---|---|---|
| Full BF16 weights | Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated | Transformers / vLLM, fine-tuning, requantizing |
| NVFP4 (this repo) | Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 | vLLM on a single ≥96 GB / Blackwell accelerator (vision + MTP included) |
| GGUF (Q4_K_M) | …-Kimi-K2.6-destill-healed-abliterated-GGUF | llama.cpp / LM Studio (text-only). MTP head included. |
| MLX 4-bit | …-Kimi-K2.6-destill-healed-abliterated-MLX-4bit | Apple Silicon / LM Studio (vision supported) |
Files
| File | Description | Size |
|---|---|---|
model-00001-of-00002.safetensors |
NVFP4-packed language weights (4-bit + FP8 scales) + lm_head |
~50.0 GB |
model-00002-of-00002.safetensors |
NVFP4-packed language weights (tail) + BF16 vision tower | ~26.4 GB |
model_mtp.safetensors |
BF16 MTP head (785 tensors, 1 hidden layer) | ~5.0 GB |
model.safetensors.index.json |
Combined weight map | — |
config.json |
Multimodal config incl. quantization_config (nvfp4-pack-quantized) |
— |
recipe.yaml |
LLM Compressor quantization recipe | — |
tokenizer*, chat_template.jinja, generation_config.json, *preprocessor_config.json |
Standard | — |
Total on disk: ≈81.5 GB (~76 GiB).
Usage (vLLM)
vLLM auto-detects the NVFP4 compressed-tensors format — no --quantization flag.
vllm serve OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--max-model-len 262144
The checkpoint ships the MTP head, so you can enable 1-token speculative decoding:
--speculative-config '{"num_speculative_tokens":1}'
Tip (Qwen3.5 MoE / Gated-DeltaNet): if
torch.compileerrors in the GDN path during startup, add--compilation-config '{"use_inductor_graph_partition":true}'.
Text + vision both work through AutoProcessor / AutoModelForImageTextToText (via the compressed-tensors integration) for non-vLLM workflows.
Vision & MTP
Both the vision tower and the MTP (multi-token-prediction) head are included and kept in BF16.
- Vision works as expected (image / video → text).
- MTP: the head is present and shape-compatible. It enables speculative decoding under vLLM, but on the upstream checkpoint it produced little measurable speedup/quality gain and would benefit from retraining — shipped intact for completeness and forward-compatibility.
Hardware
The NVFP4 weights are ≈82 GB (vs ~250 GB for the BF16 release), so the model runs on a single accelerator with ≥ 96 GB: H200, B200, RTX PRO 6000 Blackwell, or a 128 GB unified-memory NVIDIA DGX Spark / GB10. Native FP4 math requires a Blackwell GPU (compute capability ≥ 10.0 / sm_120+); on other hardware vLLM runs NVFP4 via FlashInfer/emulation.
Support & Community
- Discord: https://discord.gg/rhUZY5GEZr
- Bitcoin Donations:
bc1qsvfduzj9fjs9fugpc52yver3f2g8fp7xjxecdv
Notes
- License: MIT (inherits from the upstream Qwen3.5 base license terms)
- Base Model: OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated → Qwen/Qwen3.5-122B-A10B
- Quantization: NVFP4 (
nvfp4-pack-quantized, group size 16) via LLM Compressor - Modality: Text + Vision (image / video) + MTP
- Architecture: Qwen3 MoE (~10B active / 122B total) + Qwen3-VL vision tower + MTP head
Thanks
- Jackrong — for the idea of Qwopus merges (Opus distillations on Qwen models).
- wangzhang — for the wonderful abliterix framework, which was customized to do this abliteration.
- The LLM Compressor and vLLM teams for the NVFP4 tooling.
Disclaimer
Use is the responsibility of the user. Ensure your usage complies with applicable laws, platform rules, and deployment requirements.
- Downloads last month
- 1,759
Model tree for OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4
Base model
Qwen/Qwen3.5-122B-A10B