Instructions to use OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4")
model = AutoModelForImageTextToText.from_pretrained("OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4

SGLang

How to use OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 with Docker Model Runner:
```
docker model run hf.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4
```

Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4

Overview

4-bit NVFP4 quantization of OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated — the Kimi-K2.6-distilled, reasoning-DPO-healed, abliterated/uncensored evolution of Qwen/Qwen3.5-122B-A10B (Mixture of Experts, ~10B active / 122B total).

This build packs the transformer weights to NVFP4 with LLM Compressor, cutting the on-disk footprint from ~250 GB to ≈82 GB while keeping the vision tower, MTP head, router gates, and the Gated-DeltaNet attention path in higher precision. It is multimodal (image + text), uncensored, and — despite 4-bit weights — beats the full-precision Qwen3.5-122B-A10B baseline on every benchmark we ran (see Evaluation).

It loads anywhere compressed-tensors is supported and is auto-detected by vLLM (no --quantization flag needed).

Evaluation

Scores below were measured on this NVFP4 build and compared against the full-precision (BF16) Qwen/Qwen3.5-122B-A10B baseline:

Benchmark	Qwen3.5-122B-A10B (BF16, baseline)	Qwopus3.5 NVFP4 (this model)
CTI	64.8	71.5
LiveCodeBench	78.9	79.9
BFCL	72.2	85.6

Even after 4-bit (NVFP4) weight quantization, this model outperforms the BF16 Qwen3.5-122B-A10B baseline on all three benchmarks — the Kimi-K2.6 distillation + reasoning-DPO healing more than offsets any quantization loss. BFCL is the Berkeley Function-Calling Leaderboard (tool use); LiveCodeBench is contamination-controlled code generation.

Quantization (NVFP4)

Produced with LLM Compressor using the QuantizationModifier recipe shipped in this repo (recipe.yaml).

Scheme: NVFP4 (format: nvfp4-pack-quantized) — 4-bit float weights in micro-blocks of 16, each block carrying an FP8 (float8_e4m3fn) scale. Weights are static; input activations are quantized dynamically (per-group, static-minmax).
Quantized: all transformer Linear layers — attention projections and the 256 routed-expert MoE FFNs (37,056 packed weight tensors).
Left in higher precision (BF16): the vision tower (visual.* — 333 tensors), the MTP head (model_mtp.safetensors — 785 tensors), lm_head, token embeddings, the MoE router gates (mlp.gate, shared_expert_gate), and the Gated-DeltaNet linear-attention path (linear_attn.*).
Architecture preserved: Qwen3_5MoeForConditionalGeneration / model_type: qwen3_5_moe, so the checkpoint loads as a drop-in replacement for the base at the architecture level.

Downloads / Other Formats

Format	Repo	Use it for
Full BF16 weights	Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated	Transformers / vLLM, fine-tuning, requantizing
NVFP4 (this repo)	Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4	vLLM on a single ≥96 GB / Blackwell accelerator (vision + MTP included)
GGUF (Q4_K_M)	…-Kimi-K2.6-destill-healed-abliterated-GGUF	llama.cpp / LM Studio (text-only). MTP head included.
MLX 4-bit	…-Kimi-K2.6-destill-healed-abliterated-MLX-4bit	Apple Silicon / LM Studio (vision supported)

Files

File	Description	Size
`model-00001-of-00002.safetensors`	NVFP4-packed language weights (4-bit + FP8 scales) + `lm_head`	~50.0 GB
`model-00002-of-00002.safetensors`	NVFP4-packed language weights (tail) + BF16 vision tower	~26.4 GB
`model_mtp.safetensors`	BF16 MTP head (785 tensors, 1 hidden layer)	~5.0 GB
`model.safetensors.index.json`	Combined weight map	—
`config.json`	Multimodal config incl. `quantization_config` (`nvfp4-pack-quantized`)	—
`recipe.yaml`	LLM Compressor quantization recipe	—
`tokenizer`, `chat_template.jinja`, `generation_config.json`, `preprocessor_config.json`	Standard	—

Total on disk: ≈81.5 GB (~76 GiB).

Usage (vLLM)

vLLM auto-detects the NVFP4 compressed-tensors format — no --quantization flag.

vllm serve OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --max-model-len 262144

The checkpoint ships the MTP head, so you can enable 1-token speculative decoding:

  --speculative-config '{"num_speculative_tokens":1}'

Tip (Qwen3.5 MoE / Gated-DeltaNet): if torch.compile errors in the GDN path during startup, add --compilation-config '{"use_inductor_graph_partition":true}'.

Text + vision both work through AutoProcessor / AutoModelForImageTextToText (via the compressed-tensors integration) for non-vLLM workflows.

Vision & MTP

Both the vision tower and the MTP (multi-token-prediction) head are included and kept in BF16.

Vision works as expected (image / video → text).
MTP: the head is present and shape-compatible. It enables speculative decoding under vLLM, but on the upstream checkpoint it produced little measurable speedup/quality gain and would benefit from retraining — shipped intact for completeness and forward-compatibility.

Hardware

The NVFP4 weights are ≈82 GB (vs ~250 GB for the BF16 release), so the model runs on a single accelerator with ≥ 96 GB: H200, B200, RTX PRO 6000 Blackwell, or a 128 GB unified-memory NVIDIA DGX Spark / GB10. Native FP4 math requires a Blackwell GPU (compute capability ≥ 10.0 / sm_120+); on other hardware vLLM runs NVFP4 via FlashInfer/emulation.

Support & Community

Discord: https://discord.gg/rhUZY5GEZr
Bitcoin Donations: bc1qsvfduzj9fjs9fugpc52yver3f2g8fp7xjxecdv

Notes

License: MIT (inherits from the upstream Qwen3.5 base license terms)
Base Model: OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated → Qwen/Qwen3.5-122B-A10B
Quantization: NVFP4 (nvfp4-pack-quantized, group size 16) via LLM Compressor
Modality: Text + Vision (image / video) + MTP
Architecture: Qwen3 MoE (~10B active / 122B total) + Qwen3-VL vision tower + MTP head

Thanks

Jackrong — for the idea of Qwopus merges (Opus distillations on Qwen models).
wangzhang — for the wonderful abliterix framework, which was customized to do this abliteration.
The LLM Compressor and vLLM teams for the NVFP4 tooling.

Disclaimer

Use is the responsibility of the user. Ensure your usage complies with applicable laws, platform rules, and deployment requirements.

Downloads last month: 1,759

Safetensors

Model size

74B params

Tensor type

F32

BF16

F8_E4M3

Model tree for OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4

Base model

Qwen/Qwen3.5-122B-A10B

Finetuned

OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated

Quantized

(7)

this model