Instructions to use aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-4bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Nemotron-3.5-ASR-Streaming-0.6B-MLX-4bit aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-4bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Nemotron-3.5 ASR Streaming 0.6B — MLX 4bit
INT4 quantized (mlx.nn.quantize(group_size=64, bits=4)) port of NVIDIA's Nemotron-3.5 streaming ASR for Apple Silicon. Smallest on-disk + lowest streaming RSS of any variant. Costs ~6 pp avg WER vs fp32; Hindi and English en_us are most affected.
Model
| Parameters | 600 M (4 bit weights, group=64) |
| Architecture | FastConformer-CacheAware-RNN-T with language-conditioning prompt kernel |
| Languages | 40 |
| Sample rate | 16 kHz mono |
| Streaming chunk | 320 ms |
| On-disk size | 473 MB |
Files
| File | Size | Description |
|---|---|---|
model.safetensors |
473 MB | 4 bit grouped-quantized weights |
vocab.json |
100 KB | id → SentencePiece piece |
lang2slot.json |
2 KB | Language tag → prompt slot |
config.json |
<1 KB | Quantization config + arch/streaming geometry |
Performance
M5 Pro (Apple Silicon GPU), 50 samples per language from FLEURS test.
Accuracy
| lang | WER % | CER % | Δ WER vs fp32 source |
|---|---|---|---|
| en_us | 15.98 | 8.69 | +6.65 |
| de_de | 14.96 | 6.43 | +4.74 |
| fr_fr | 15.85 | 5.98 | +4.72 |
| ar_eg | 20.88 | 5.61 | +7.61 |
| hi_in | 8.13 | 6.52 | +2.87 |
| ja_jp | 19.56 | 12.89 | +2.59 |
Recommendation: ship 4bit only when the accuracy hit is acceptable (on-device assistants, transcription-as-hint). Use 8bit or CoreML INT8 for production transcription.
Streaming throughput + memory
| metric | value |
|---|---|
| RTF (encode + decode) | 0.041 |
| p50 chunk latency | 12.8 ms |
| p99 chunk latency | 15.6 ms |
| RSS post-load | 270 MB (mmap) |
| RSS peak (mid-stream) | 747 MB |
This is the lowest absolute streaming peak of all measured variants — 40 % below CoreML INT8 (1238 MB).
Usage
Python / MLX
from huggingface_hub import snapshot_download
bundle = snapshot_download("aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-4bit")
# Load + assemble like bf16 sibling.
Swift (speech-swift)
The speech-swift SDK ships the CoreML INT8 variant. If you specifically need the lowest peak RSS (~750 MB) and can accept ~6 pp higher WER, you'd need to wire this MLX 4bit bundle via mlx-swift; otherwise the CoreML INT8 wrapper is recommended.
CLI
brew install soniqo/tap/speech
# CLI defaults to the CoreML INT8 bundle (--engine nemotron); MLX variants
# are loaded via the Python pipeline above.
speech transcribe recording.wav --engine nemotron --language en-US
Source
Upstream: nvidia/nemotron-3.5-asr-streaming-0.6b.
Links
- Downloads last month
- 74
Quantized
Model tree for aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-4bit
Base model
nvidia/nemotron-3.5-asr-streaming-0.6b