Instructions to use nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF

SGLang

How to use nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF with Docker Model Runner:
```
docker model run hf.co/nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF
```

Qwen3.6-35B-A3B-NVFP4-MTP (GGUF)

NVFP4-quantized, multimodal GGUF build of Qwen3.6-35B-A3B for llama.cpp on NVIDIA DGX Spark (GB10, SM121).

Role in the .init Stack

This is the fast, large-context, vision-capable model for the .init AI engineering platform. It's exposed as the .INIT/Flash alias in LiteLLM and is the go-to model for long-context tasks and multimodal reasoning on DGX Spark.

Property	Value
`.INIT/` alias	`.INIT/Flash`
Docker profile	`llama-qwen-3-6-35b-a3b`
Host port	`8001`
Use case	Large-context analysis, vision tasks, fast inference
Context	256K tokens (entire repositories + images)
Vision	✅ mmproj BF16 GGUF for multimodal reasoning

Why This Model for DGX Spark

MoE architecture — 35B total, only 3.5B active params per token → fast inference (~30 tok/s) despite large total size
256K context — YaRN scaling from 128K native base, enough for full repository scans
Vision support — separate mmproj GGUF enables code screenshot analysis, UI understanding, diagram reasoning
~22 GB + 863 MB mmproj → ~80 GB in GPU memory, fits in 128 GB with room for OS and other services
NVFP4 quantization — works on SM121 without server-class TMEM hardware required by vLLM

Optimal Settings for the .init Stack

These are the settings used in docker-compose.interface.yml:

llama-server \
  -m /models/model.gguf \
  -mm /models/mmproj.gguf \
  -a qwen3.6_35b_a3b \
  --jinja --chat-template-file /workspace/chat_template.jinja \
  --reasoning on --reasoning-format deepseek --reasoning-budget 8192 \
  --min-p 0.05 \
  --spec-type draft-mtp,ngram-mod --spec-draft-n-max 2 \
  --spec-draft-p-min 0.88 --spec-draft-ngl 99 \
  -ctk f16 -ctv f16 -ngl all -fa on -sm none -fit off \
  -c 262144 --rope-scaling yarn --yarn-orig-ctx 131072 \
  -b 2048 -ub 512 \
  --parallel 1 --cont-batching --cache-prompt --swa-full \
  -t 8 -tb 8 --mlock \
  --port 8080 --host 0.0.0.0 --metrics --timeout 120

When to Use in .init

Task	Use .INIT/Flash (35B A3B)	Use .INIT/Pro (27B)
Quick code generation	✅ Fast	Better quality
Full repo scan (50K-256K tokens)	✅ 256K context	❌ 128K limit
UI screenshot analysis	✅ Vision mmproj	❌ Text-only
Diagram / chart reasoning	✅ Vision mmproj	❌ Text-only
Deep debugging / architecture	Good	✅ Better reasoning
Chain-of-thought tasks	Good	✅ Deeper reasoning

Why This Exists

Inspired by sakamakismile/Qwen3.6-35B-A3B-NVFP4 — NVFP4 quantization + MTP restoration recipe
Goal: build a GGUF-compatible NVFP4 image for use with llama.cpp on DGX Spark, including the multimodal projector (mmproj) for vision-language tasks

What Changed

Quantized to NVFP4 via nvidia-modelopt (group_size=16)
MTP head preserved in BF16 for speculative decoding
Multimodal projector (mmproj) exported separately as BF16 GGUF for vision tasks
Calibrated on neuralmagic/calibration (20 samples)

Why llama.cpp + ModelOpt NVFP4 (and not vLLM)

The vLLM Problem on DGX Spark

Deploying NVFP4 via vLLM on DGX Spark (SM120/SM121) hits multiple blockers:

Missing TMEM hardware — DGX Spark's edge-tier Blackwell chip lacks the 256 KB Tensor Memory (TMEM) found in datacenter SM100. NVFP4's block-packed layout cannot take the hardware-accelerated fast path, wasting VRAM and losing throughput vs. AWQ/FP8.
Unoptimized kernels — vLLM's NVFP4 kernels for SM120 fall back to slow software paths, underperforming established 4-bit formats.
MTP shape mismatch — Qwen3.6's speculative decoding head fails to inherit NVFP4's quantization layout in vLLM, causing batch initialization errors or 0% acceptance rates.
Illegal instruction crashes — vLLM may invoke server-class Cutlass/FlashInfer backends incompatible with Spark's edge silicon.
Driver lock — vLLM's latest NVFP4 support requires driver 595.58+, but DGX Spark ships with 580.x. Forcing an upgrade can break the unified memory fabric.

Why This Stack Works

Component	Why
llama.cpp	No TMEM dependency — GGUF loads weights directly into GPU memory without layout transformations that require server-class hardware
ModelOpt NVFP4	NVIDIA's own quantizer produces compact weights (~24 GB for 35B A3B) with native BF16 MTP preservation
MTP + n-gram	Dual speculative decoding path achieves high throughput on DGX Spark without vLLM's MTP bugs
Multimodal	Separate BF16 mmproj GGUF enables vision-language tasks without re-quantizing the projector

File	Size	Description
`qwen3.6-35b-a3b-nvfp4-mtp.gguf`	~22 GB	Main model weights (NVFP4 quantized)
`mmproj-BF16.gguf`	~863 MB	Multimodal projector (BF16, for vision tasks)

Performance

Condition	Throughput	Notes
DGX Spark, short prompts	~30 tok/s	MTP n=2 + ngram speculative decoding, model fully on GPU
DGX Spark, long context (256K)	~18–28 tok/s	YaRN RoPE scaling, KV cache grows with context

30 tok/s achieved when:

Single DGX Spark node (GB10, SM121, 128 GB unified memory)
Speculative decoding enabled (--spec-type draft-mtp,ngram-mod, --spec-draft-n-max 2)
Model fully resident on GPU (-ngl all, --mlock)
Short-to-medium context (< 8K tokens in prompt)
8 threads, flash attention on, F16 KV cache

Optimal Settings for Long Context

llama-server \
  -m qwen3.6-35b-a3b-nvfp4-mtp.gguf \
  -mm mmproj-BF16.gguf \
  -a qwen3.6_35b_a3b \
  --jinja \
  --chat-template-file /workspace/chat_template.jinja \
  --reasoning on \
  --reasoning-format deepseek \
  --reasoning-budget 8192 \
  --min-p 0.05 \
  --spec-type draft-mtp,ngram-mod \
  --spec-draft-n-max 2 \
  --spec-draft-p-min 0.88 \
  --spec-draft-ngl 99 \
  -ctk f16 -ctv f16 \
  -ngl all \
  -fa on \
  -sm none \
  -fit off \
  -c 262144 \
  --rope-scaling yarn \
  --yarn-orig-ctx 131072 \
  -b 2048 \
  -ub 512 \
  --parallel 1 \
  --cont-batching \
  --cache-prompt \
  --swa-full \
  -t 8 -tb 8 \
  --mlock \
  --port 8080 \
  --host 0.0.0.0 \
  --metrics \
  --timeout 120

Key Settings Explained

Flag	Value	Why
`-m`	`qwen3.6-35b-a3b-nvfp4-mtp.gguf`	Main NVFP4-quantized model weights
`-mm`	`mmproj-BF16.gguf`	Multimodal projector for vision-language tasks
`-c`	`262144`	256K context window for long documents
`--rope-scaling yarn` + `--yarn-orig-ctx 131072`	YaRN extrapolation from 128K base	Preserves quality beyond native context
`--spec-type draft-mtp,ngram-mod`	MTP + n-gram hybrid	Dual speculative path for higher acceptance rate
`--spec-draft-n-max 2`	2 draft tokens	Conservative speculation for larger model
`--spec-draft-p-min 0.88`	88% acceptance threshold	Balanced speculation, fallback to n-gram
`--reasoning-budget 8192`	8192 tokens	Extended reasoning budget for complex tasks
`-ngl all`	All layers on GPU	No CPU offloading — DGX Spark has 128 GB
`-fa on`	Flash attention	O(n) memory for long context
`-ctk f16` / `-ctv f16`	F16 KV cache	Precision-critical for 256K context
`-b 2048` / `-ub 512`	Prefill 2048, decode 512	Balanced batch sizing for throughput
`--parallel 1`	1 concurrent sequence	Larger model — single sequence avoids memory pressure
`--cont-batching`	Continuous batching	Better GPU utilization under load
`--swa-full`	Full sliding window attention	Better long-range attention quality
`--mlock`	Lock in RAM	Prevents eviction during long generations

When to Use Long Context Settings

Codebase analysis: scanning entire repositories (50K–256K tokens)
Document reasoning: legal/technical documents with cross-reference needs
Extended conversations: multi-turn sessions accumulating context
Multimodal reasoning: vision-language tasks with long textual context
Not needed for: chat, quick Q&A, or prompts < 8K tokens (use simpler settings for max throughput)

Usage

# Quick start (text-only)
llama-server -m qwen3.6-35b-a3b-nvfp4-mtp.gguf --port 8081

# With multimodal projector (vision-language)
llama-server -m qwen3.6-35b-a3b-nvfp4-mtp.gguf -mm mmproj-BF16.gguf --port 8081

Differences from Qwen3.6-27B-Text

Aspect	27B Text	35B A3B
Parameters	27B	35B + A3B MoE
Quantization	NVFP4	NVFP4
Multimodal	No (text-only)	Yes (mmproj BF16)
Context window	128K	256K
Reasoning budget	8192	8192
Speculative draft max	2	2
Parallel sequences	1	1
Model size	~14 GB	~22 GB + 863 MB mmproj
Host port	8000	8001

License

Apache 2.0

Downloads last month: 1,009

GGUF

Hardware compatibility

4-bit

Model tree for nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(435)

this model

nilayparikh
/

Qwen3.6-35B-A3B-NVFP4-MTP-GGUF