Instructions to use nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF
- SGLang
How to use nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF with Docker Model Runner:
docker model run hf.co/nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF
Use Docker images
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF" \
--host 0.0.0.0 \
--port 30000# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'Qwen3.6-35B-A3B-NVFP4-MTP (GGUF)
NVFP4-quantized, multimodal GGUF build of Qwen3.6-35B-A3B for llama.cpp on NVIDIA DGX Spark (GB10, SM121).
Role in the .init Stack
This is the fast, large-context, vision-capable model for the .init AI engineering platform. It's exposed as the .INIT/Flash alias in LiteLLM and is the go-to model for long-context tasks and multimodal reasoning on DGX Spark.
| Property | Value |
|---|---|
.INIT/ alias |
.INIT/Flash |
| Docker profile | llama-qwen-3-6-35b-a3b |
| Host port | 8001 |
| Use case | Large-context analysis, vision tasks, fast inference |
| Context | 256K tokens (entire repositories + images) |
| Vision | β mmproj BF16 GGUF for multimodal reasoning |
Why This Model for DGX Spark
- MoE architecture β 35B total, only 3.5B active params per token β fast inference (~30 tok/s) despite large total size
- 256K context β YaRN scaling from 128K native base, enough for full repository scans
- Vision support β separate mmproj GGUF enables code screenshot analysis, UI understanding, diagram reasoning
- ~22 GB + 863 MB mmproj β ~80 GB in GPU memory, fits in 128 GB with room for OS and other services
- NVFP4 quantization β works on SM121 without server-class TMEM hardware required by vLLM
Optimal Settings for the .init Stack
These are the settings used in docker-compose.interface.yml:
llama-server \
-m /models/model.gguf \
-mm /models/mmproj.gguf \
-a qwen3.6_35b_a3b \
--jinja --chat-template-file /workspace/chat_template.jinja \
--reasoning on --reasoning-format deepseek --reasoning-budget 8192 \
--min-p 0.05 \
--spec-type draft-mtp,ngram-mod --spec-draft-n-max 2 \
--spec-draft-p-min 0.88 --spec-draft-ngl 99 \
-ctk f16 -ctv f16 -ngl all -fa on -sm none -fit off \
-c 262144 --rope-scaling yarn --yarn-orig-ctx 131072 \
-b 2048 -ub 512 \
--parallel 1 --cont-batching --cache-prompt --swa-full \
-t 8 -tb 8 --mlock \
--port 8080 --host 0.0.0.0 --metrics --timeout 120
When to Use in .init
| Task | Use .INIT/Flash (35B A3B) | Use .INIT/Pro (27B) |
|---|---|---|
| Quick code generation | β Fast | Better quality |
| Full repo scan (50K-256K tokens) | β 256K context | β 128K limit |
| UI screenshot analysis | β Vision mmproj | β Text-only |
| Diagram / chart reasoning | β Vision mmproj | β Text-only |
| Deep debugging / architecture | Good | β Better reasoning |
| Chain-of-thought tasks | Good | β Deeper reasoning |
Why This Exists
- Inspired by
sakamakismile/Qwen3.6-35B-A3B-NVFP4β NVFP4 quantization + MTP restoration recipe - Goal: build a GGUF-compatible NVFP4 image for use with llama.cpp on DGX Spark, including the multimodal projector (mmproj) for vision-language tasks
What Changed
- Quantized to NVFP4 via
nvidia-modelopt(group_size=16) - MTP head preserved in BF16 for speculative decoding
- Multimodal projector (mmproj) exported separately as BF16 GGUF for vision tasks
- Calibrated on
neuralmagic/calibration(20 samples)
Why llama.cpp + ModelOpt NVFP4 (and not vLLM)
The vLLM Problem on DGX Spark
Deploying NVFP4 via vLLM on DGX Spark (SM120/SM121) hits multiple blockers:
- Missing TMEM hardware β DGX Spark's edge-tier Blackwell chip lacks the 256 KB Tensor Memory (TMEM) found in datacenter SM100. NVFP4's block-packed layout cannot take the hardware-accelerated fast path, wasting VRAM and losing throughput vs. AWQ/FP8.
- Unoptimized kernels β vLLM's NVFP4 kernels for SM120 fall back to slow software paths, underperforming established 4-bit formats.
- MTP shape mismatch β Qwen3.6's speculative decoding head fails to inherit NVFP4's quantization layout in vLLM, causing batch initialization errors or 0% acceptance rates.
- Illegal instruction crashes β vLLM may invoke server-class Cutlass/FlashInfer backends incompatible with Spark's edge silicon.
- Driver lock β vLLM's latest NVFP4 support requires driver 595.58+, but DGX Spark ships with 580.x. Forcing an upgrade can break the unified memory fabric.
Why This Stack Works
| Component | Why |
|---|---|
| llama.cpp | No TMEM dependency β GGUF loads weights directly into GPU memory without layout transformations that require server-class hardware |
| ModelOpt NVFP4 | NVIDIA's own quantizer produces compact weights (~24 GB for 35B A3B) with native BF16 MTP preservation |
| MTP + n-gram | Dual speculative decoding path achieves high throughput on DGX Spark without vLLM's MTP bugs |
| Multimodal | Separate BF16 mmproj GGUF enables vision-language tasks without re-quantizing the projector |
Contents
| File | Size | Description |
|---|---|---|
qwen3.6-35b-a3b-nvfp4-mtp.gguf |
~22 GB | Main model weights (NVFP4 quantized) |
mmproj-BF16.gguf |
~863 MB | Multimodal projector (BF16, for vision tasks) |
Performance
| Condition | Throughput | Notes |
|---|---|---|
| DGX Spark, short prompts | ~30 tok/s | MTP n=2 + ngram speculative decoding, model fully on GPU |
| DGX Spark, long context (256K) | ~18β28 tok/s | YaRN RoPE scaling, KV cache grows with context |
30 tok/s achieved when:
- Single DGX Spark node (GB10, SM121, 128 GB unified memory)
- Speculative decoding enabled (
--spec-type draft-mtp,ngram-mod,--spec-draft-n-max 2) - Model fully resident on GPU (
-ngl all,--mlock) - Short-to-medium context (< 8K tokens in prompt)
- 8 threads, flash attention on, F16 KV cache
Optimal Settings for Long Context
llama-server \
-m qwen3.6-35b-a3b-nvfp4-mtp.gguf \
-mm mmproj-BF16.gguf \
-a qwen3.6_35b_a3b \
--jinja \
--chat-template-file /workspace/chat_template.jinja \
--reasoning on \
--reasoning-format deepseek \
--reasoning-budget 8192 \
--min-p 0.05 \
--spec-type draft-mtp,ngram-mod \
--spec-draft-n-max 2 \
--spec-draft-p-min 0.88 \
--spec-draft-ngl 99 \
-ctk f16 -ctv f16 \
-ngl all \
-fa on \
-sm none \
-fit off \
-c 262144 \
--rope-scaling yarn \
--yarn-orig-ctx 131072 \
-b 2048 \
-ub 512 \
--parallel 1 \
--cont-batching \
--cache-prompt \
--swa-full \
-t 8 -tb 8 \
--mlock \
--port 8080 \
--host 0.0.0.0 \
--metrics \
--timeout 120
Key Settings Explained
| Flag | Value | Why |
|---|---|---|
-m |
qwen3.6-35b-a3b-nvfp4-mtp.gguf |
Main NVFP4-quantized model weights |
-mm |
mmproj-BF16.gguf |
Multimodal projector for vision-language tasks |
-c |
262144 |
256K context window for long documents |
--rope-scaling yarn + --yarn-orig-ctx 131072 |
YaRN extrapolation from 128K base | Preserves quality beyond native context |
--spec-type draft-mtp,ngram-mod |
MTP + n-gram hybrid | Dual speculative path for higher acceptance rate |
--spec-draft-n-max 2 |
2 draft tokens | Conservative speculation for larger model |
--spec-draft-p-min 0.88 |
88% acceptance threshold | Balanced speculation, fallback to n-gram |
--reasoning-budget 8192 |
8192 tokens | Extended reasoning budget for complex tasks |
-ngl all |
All layers on GPU | No CPU offloading β DGX Spark has 128 GB |
-fa on |
Flash attention | O(n) memory for long context |
-ctk f16 / -ctv f16 |
F16 KV cache | Precision-critical for 256K context |
-b 2048 / -ub 512 |
Prefill 2048, decode 512 | Balanced batch sizing for throughput |
--parallel 1 |
1 concurrent sequence | Larger model β single sequence avoids memory pressure |
--cont-batching |
Continuous batching | Better GPU utilization under load |
--swa-full |
Full sliding window attention | Better long-range attention quality |
--mlock |
Lock in RAM | Prevents eviction during long generations |
When to Use Long Context Settings
- Codebase analysis: scanning entire repositories (50Kβ256K tokens)
- Document reasoning: legal/technical documents with cross-reference needs
- Extended conversations: multi-turn sessions accumulating context
- Multimodal reasoning: vision-language tasks with long textual context
- Not needed for: chat, quick Q&A, or prompts < 8K tokens (use simpler settings for max throughput)
Usage
# Quick start (text-only)
llama-server -m qwen3.6-35b-a3b-nvfp4-mtp.gguf --port 8081
# With multimodal projector (vision-language)
llama-server -m qwen3.6-35b-a3b-nvfp4-mtp.gguf -mm mmproj-BF16.gguf --port 8081
Differences from Qwen3.6-27B-Text
| Aspect | 27B Text | 35B A3B |
|---|---|---|
| Parameters | 27B | 35B + A3B MoE |
| Quantization | NVFP4 | NVFP4 |
| Multimodal | No (text-only) | Yes (mmproj BF16) |
| Context window | 128K | 256K |
| Reasoning budget | 8192 | 8192 |
| Speculative draft max | 2 | 2 |
| Parallel sequences | 1 | 1 |
| Model size | ~14 GB | ~22 GB + 863 MB mmproj |
| Host port | 8000 | 8001 |
License
Apache 2.0
- Downloads last month
- 1,009
4-bit
Model tree for nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF
Base model
Qwen/Qwen3.6-35B-A3B
Install from pip and serve model
# Install SGLang from pip: pip install sglang# Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF" \ --host 0.0.0.0 \ --port 30000# Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'