Nemotron-3.5 ASR Streaming 0.6B — MLX 4bit

INT4 quantized (mlx.nn.quantize(group_size=64, bits=4)) port of NVIDIA's Nemotron-3.5 streaming ASR for Apple Silicon. Smallest on-disk + lowest streaming RSS of any variant. Costs ~6 pp avg WER vs fp32; Hindi and English en_us are most affected.

Model

Parameters 600 M (4 bit weights, group=64)
Architecture FastConformer-CacheAware-RNN-T with language-conditioning prompt kernel
Languages 40
Sample rate 16 kHz mono
Streaming chunk 320 ms
On-disk size 473 MB

Files

File Size Description
model.safetensors 473 MB 4 bit grouped-quantized weights
vocab.json 100 KB id → SentencePiece piece
lang2slot.json 2 KB Language tag → prompt slot
config.json <1 KB Quantization config + arch/streaming geometry

Performance

M5 Pro (Apple Silicon GPU), 50 samples per language from FLEURS test.

Accuracy

lang WER % CER % Δ WER vs fp32 source
en_us 15.98 8.69 +6.65
de_de 14.96 6.43 +4.74
fr_fr 15.85 5.98 +4.72
ar_eg 20.88 5.61 +7.61
hi_in 8.13 6.52 +2.87
ja_jp 19.56 12.89 +2.59

Recommendation: ship 4bit only when the accuracy hit is acceptable (on-device assistants, transcription-as-hint). Use 8bit or CoreML INT8 for production transcription.

Streaming throughput + memory

metric value
RTF (encode + decode) 0.041
p50 chunk latency 12.8 ms
p99 chunk latency 15.6 ms
RSS post-load 270 MB (mmap)
RSS peak (mid-stream) 747 MB

This is the lowest absolute streaming peak of all measured variants — 40 % below CoreML INT8 (1238 MB).

Usage

Python / MLX

from huggingface_hub import snapshot_download
bundle = snapshot_download("aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-4bit")
# Load + assemble like bf16 sibling.

Swift (speech-swift)

The speech-swift SDK ships the CoreML INT8 variant. If you specifically need the lowest peak RSS (~750 MB) and can accept ~6 pp higher WER, you'd need to wire this MLX 4bit bundle via mlx-swift; otherwise the CoreML INT8 wrapper is recommended.

CLI

brew install soniqo/tap/speech
# CLI defaults to the CoreML INT8 bundle (--engine nemotron); MLX variants
# are loaded via the Python pipeline above.
speech transcribe recording.wav --engine nemotron --language en-US

Source

Upstream: nvidia/nemotron-3.5-asr-streaming-0.6b.

Links

Downloads last month
74
Safetensors
Model size
0.2B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-4bit

Finetuned
(9)
this model

Collection including aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-4bit