How to use from
SGLang
Install from pip and serve model
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'
Use Docker images
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'
Quick Links

Qwen3.6-35B-A3B-NVFP4-MTP (GGUF)

NVFP4-quantized, multimodal GGUF build of Qwen3.6-35B-A3B for llama.cpp on NVIDIA DGX Spark (GB10, SM121).

Role in the .init Stack

This is the fast, large-context, vision-capable model for the .init AI engineering platform. It's exposed as the .INIT/Flash alias in LiteLLM and is the go-to model for long-context tasks and multimodal reasoning on DGX Spark.

Property Value
.INIT/ alias .INIT/Flash
Docker profile llama-qwen-3-6-35b-a3b
Host port 8001
Use case Large-context analysis, vision tasks, fast inference
Context 256K tokens (entire repositories + images)
Vision βœ… mmproj BF16 GGUF for multimodal reasoning

Why This Model for DGX Spark

  • MoE architecture β€” 35B total, only 3.5B active params per token β†’ fast inference (~30 tok/s) despite large total size
  • 256K context β€” YaRN scaling from 128K native base, enough for full repository scans
  • Vision support β€” separate mmproj GGUF enables code screenshot analysis, UI understanding, diagram reasoning
  • ~22 GB + 863 MB mmproj β†’ ~80 GB in GPU memory, fits in 128 GB with room for OS and other services
  • NVFP4 quantization β€” works on SM121 without server-class TMEM hardware required by vLLM

Optimal Settings for the .init Stack

These are the settings used in docker-compose.interface.yml:

llama-server \
  -m /models/model.gguf \
  -mm /models/mmproj.gguf \
  -a qwen3.6_35b_a3b \
  --jinja --chat-template-file /workspace/chat_template.jinja \
  --reasoning on --reasoning-format deepseek --reasoning-budget 8192 \
  --min-p 0.05 \
  --spec-type draft-mtp,ngram-mod --spec-draft-n-max 2 \
  --spec-draft-p-min 0.88 --spec-draft-ngl 99 \
  -ctk f16 -ctv f16 -ngl all -fa on -sm none -fit off \
  -c 262144 --rope-scaling yarn --yarn-orig-ctx 131072 \
  -b 2048 -ub 512 \
  --parallel 1 --cont-batching --cache-prompt --swa-full \
  -t 8 -tb 8 --mlock \
  --port 8080 --host 0.0.0.0 --metrics --timeout 120

When to Use in .init

Task Use .INIT/Flash (35B A3B) Use .INIT/Pro (27B)
Quick code generation βœ… Fast Better quality
Full repo scan (50K-256K tokens) βœ… 256K context ❌ 128K limit
UI screenshot analysis βœ… Vision mmproj ❌ Text-only
Diagram / chart reasoning βœ… Vision mmproj ❌ Text-only
Deep debugging / architecture Good βœ… Better reasoning
Chain-of-thought tasks Good βœ… Deeper reasoning

Why This Exists

  • Inspired by sakamakismile/Qwen3.6-35B-A3B-NVFP4 β€” NVFP4 quantization + MTP restoration recipe
  • Goal: build a GGUF-compatible NVFP4 image for use with llama.cpp on DGX Spark, including the multimodal projector (mmproj) for vision-language tasks

What Changed

  • Quantized to NVFP4 via nvidia-modelopt (group_size=16)
  • MTP head preserved in BF16 for speculative decoding
  • Multimodal projector (mmproj) exported separately as BF16 GGUF for vision tasks
  • Calibrated on neuralmagic/calibration (20 samples)

Why llama.cpp + ModelOpt NVFP4 (and not vLLM)

The vLLM Problem on DGX Spark

Deploying NVFP4 via vLLM on DGX Spark (SM120/SM121) hits multiple blockers:

  1. Missing TMEM hardware β€” DGX Spark's edge-tier Blackwell chip lacks the 256 KB Tensor Memory (TMEM) found in datacenter SM100. NVFP4's block-packed layout cannot take the hardware-accelerated fast path, wasting VRAM and losing throughput vs. AWQ/FP8.
  2. Unoptimized kernels β€” vLLM's NVFP4 kernels for SM120 fall back to slow software paths, underperforming established 4-bit formats.
  3. MTP shape mismatch β€” Qwen3.6's speculative decoding head fails to inherit NVFP4's quantization layout in vLLM, causing batch initialization errors or 0% acceptance rates.
  4. Illegal instruction crashes β€” vLLM may invoke server-class Cutlass/FlashInfer backends incompatible with Spark's edge silicon.
  5. Driver lock β€” vLLM's latest NVFP4 support requires driver 595.58+, but DGX Spark ships with 580.x. Forcing an upgrade can break the unified memory fabric.

Why This Stack Works

Component Why
llama.cpp No TMEM dependency β€” GGUF loads weights directly into GPU memory without layout transformations that require server-class hardware
ModelOpt NVFP4 NVIDIA's own quantizer produces compact weights (~24 GB for 35B A3B) with native BF16 MTP preservation
MTP + n-gram Dual speculative decoding path achieves high throughput on DGX Spark without vLLM's MTP bugs
Multimodal Separate BF16 mmproj GGUF enables vision-language tasks without re-quantizing the projector

Contents

File Size Description
qwen3.6-35b-a3b-nvfp4-mtp.gguf ~22 GB Main model weights (NVFP4 quantized)
mmproj-BF16.gguf ~863 MB Multimodal projector (BF16, for vision tasks)

Performance

Condition Throughput Notes
DGX Spark, short prompts ~30 tok/s MTP n=2 + ngram speculative decoding, model fully on GPU
DGX Spark, long context (256K) ~18–28 tok/s YaRN RoPE scaling, KV cache grows with context

30 tok/s achieved when:

  • Single DGX Spark node (GB10, SM121, 128 GB unified memory)
  • Speculative decoding enabled (--spec-type draft-mtp,ngram-mod, --spec-draft-n-max 2)
  • Model fully resident on GPU (-ngl all, --mlock)
  • Short-to-medium context (< 8K tokens in prompt)
  • 8 threads, flash attention on, F16 KV cache

Optimal Settings for Long Context

llama-server \
  -m qwen3.6-35b-a3b-nvfp4-mtp.gguf \
  -mm mmproj-BF16.gguf \
  -a qwen3.6_35b_a3b \
  --jinja \
  --chat-template-file /workspace/chat_template.jinja \
  --reasoning on \
  --reasoning-format deepseek \
  --reasoning-budget 8192 \
  --min-p 0.05 \
  --spec-type draft-mtp,ngram-mod \
  --spec-draft-n-max 2 \
  --spec-draft-p-min 0.88 \
  --spec-draft-ngl 99 \
  -ctk f16 -ctv f16 \
  -ngl all \
  -fa on \
  -sm none \
  -fit off \
  -c 262144 \
  --rope-scaling yarn \
  --yarn-orig-ctx 131072 \
  -b 2048 \
  -ub 512 \
  --parallel 1 \
  --cont-batching \
  --cache-prompt \
  --swa-full \
  -t 8 -tb 8 \
  --mlock \
  --port 8080 \
  --host 0.0.0.0 \
  --metrics \
  --timeout 120

Key Settings Explained

Flag Value Why
-m qwen3.6-35b-a3b-nvfp4-mtp.gguf Main NVFP4-quantized model weights
-mm mmproj-BF16.gguf Multimodal projector for vision-language tasks
-c 262144 256K context window for long documents
--rope-scaling yarn + --yarn-orig-ctx 131072 YaRN extrapolation from 128K base Preserves quality beyond native context
--spec-type draft-mtp,ngram-mod MTP + n-gram hybrid Dual speculative path for higher acceptance rate
--spec-draft-n-max 2 2 draft tokens Conservative speculation for larger model
--spec-draft-p-min 0.88 88% acceptance threshold Balanced speculation, fallback to n-gram
--reasoning-budget 8192 8192 tokens Extended reasoning budget for complex tasks
-ngl all All layers on GPU No CPU offloading β€” DGX Spark has 128 GB
-fa on Flash attention O(n) memory for long context
-ctk f16 / -ctv f16 F16 KV cache Precision-critical for 256K context
-b 2048 / -ub 512 Prefill 2048, decode 512 Balanced batch sizing for throughput
--parallel 1 1 concurrent sequence Larger model β€” single sequence avoids memory pressure
--cont-batching Continuous batching Better GPU utilization under load
--swa-full Full sliding window attention Better long-range attention quality
--mlock Lock in RAM Prevents eviction during long generations

When to Use Long Context Settings

  • Codebase analysis: scanning entire repositories (50K–256K tokens)
  • Document reasoning: legal/technical documents with cross-reference needs
  • Extended conversations: multi-turn sessions accumulating context
  • Multimodal reasoning: vision-language tasks with long textual context
  • Not needed for: chat, quick Q&A, or prompts < 8K tokens (use simpler settings for max throughput)

Usage

# Quick start (text-only)
llama-server -m qwen3.6-35b-a3b-nvfp4-mtp.gguf --port 8081

# With multimodal projector (vision-language)
llama-server -m qwen3.6-35b-a3b-nvfp4-mtp.gguf -mm mmproj-BF16.gguf --port 8081

Differences from Qwen3.6-27B-Text

Aspect 27B Text 35B A3B
Parameters 27B 35B + A3B MoE
Quantization NVFP4 NVFP4
Multimodal No (text-only) Yes (mmproj BF16)
Context window 128K 256K
Reasoning budget 8192 8192
Speculative draft max 2 2
Parallel sequences 1 1
Model size ~14 GB ~22 GB + 863 MB mmproj
Host port 8000 8001

License

Apache 2.0

Downloads last month
1,009
GGUF
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nilayparikh/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF

Quantized
(435)
this model