Nemotron-3.5 ASR Streaming 0.6B — MLX 4bit

INT4 quantized (mlx.nn.quantize(group_size=64, bits=4)) port of NVIDIA's Nemotron-3.5 streaming ASR for Apple Silicon. Smallest on-disk + lowest streaming RSS of any variant. Costs ~6 pp avg WER vs fp32; Hindi and English en_us are most affected.

Model


Parameters	600 M (4 bit weights, group=64)
Architecture	FastConformer-CacheAware-RNN-T with language-conditioning prompt kernel
Languages	40
Sample rate	16 kHz mono
Streaming chunk	320 ms
On-disk size	473 MB

Files

File	Size	Description
`model.safetensors`	473 MB	4 bit grouped-quantized weights
`vocab.json`	100 KB	id → SentencePiece piece
`lang2slot.json`	2 KB	Language tag → prompt slot
`config.json`	<1 KB	Quantization config + arch/streaming geometry

Performance

M5 Pro (Apple Silicon GPU), 50 samples per language from FLEURS test.

Accuracy

lang	WER %	CER %	Δ WER vs fp32 source
en_us	15.98	8.69	+6.65
de_de	14.96	6.43	+4.74
fr_fr	15.85	5.98	+4.72
ar_eg	20.88	5.61	+7.61
hi_in	8.13	6.52	+2.87
ja_jp	19.56	12.89	+2.59

Recommendation: ship 4bit only when the accuracy hit is acceptable (on-device assistants, transcription-as-hint). Use 8bit or CoreML INT8 for production transcription.

Streaming throughput + memory

metric	value
RTF (encode + decode)	0.041
p50 chunk latency	12.8 ms
p99 chunk latency	15.6 ms
RSS post-load	270 MB (mmap)
RSS peak (mid-stream)	747 MB

This is the lowest absolute streaming peak of all measured variants — 40 % below CoreML INT8 (1238 MB).

Usage

Python / MLX

from huggingface_hub import snapshot_download
bundle = snapshot_download("aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-4bit")
# Load + assemble like bf16 sibling.

Swift (speech-swift)

The speech-swift SDK ships the CoreML INT8 variant. If you specifically need the lowest peak RSS (~750 MB) and can accept ~6 pp higher WER, you'd need to wire this MLX 4bit bundle via mlx-swift; otherwise the CoreML INT8 wrapper is recommended.

CLI

brew install soniqo/tap/speech
# CLI defaults to the CoreML INT8 bundle (--engine nemotron); MLX variants
# are loaded via the Python pipeline above.
speech transcribe recording.wav --engine nemotron --language en-US

Source

Upstream: nvidia/nemotron-3.5-asr-streaming-0.6b.

Model tree for aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-4bit

Base model

nvidia/nemotron-3.5-asr-streaming-0.6b

Finetuned

(9)

this model

Collection including aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-4bit

MLX Speech Models

Collection

Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. • 56 items • Updated 1 day ago • 5

aufklarer
/

Nemotron-3.5-ASR-Streaming-0.6B-MLX-4bit