FP16 test draft model conversion for google/gemma-4-12B-it

Surprisingly "--spec-draft-n-max 1" looks like the winner but it could be due to a whole slew of factors including a poor draft quantization by me.

Task Baseline (tok/s) Draft N=1 (tok/s) Draft N=2 (tok/s) Draft N=3 (tok/s) Draft N=4 (tok/s)
code_python 49.3 51.5 46.5 45.6 50.1
code_cpp 49.2 60.3 55.8 54.6 50.9
explain_concept 49.2 57.1 54.4 49.9 46.8
summarize 48.7 60.8 61.3 62.8 55.9
qa_factual 48.6 59.5 55.9 56.8 55.5
translation 48.8 59.0 53.9 59.2 49.5
creative_short 48.0 52.3 46.0 46.1 41.4
stepwise_math 48.4 57.4 55.4 60.6 49.8
long_code_review 47.7 52.8 53.9 52.0 44.4
Aggregate Avg. ~48.6 ~56.5 ~53.7 ~54.3 ~49.5
Downloads last month
1,384
GGUF
Model size
0.4B params
Architecture
gemma4-assistant
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for IHaveNoClueAndIMustPost/gemma-4-12B-it-assistant-GGUF

Quantized
(12)
this model