The iron ML notebook

Ctrlk

Quantization

A technique that reduces the numerical precision of a model's weights and activations. Typically, from FP32 to lower precision formats like FP16, INT8, or INT4.

Benefits:

Smaller model size
Faster inference speed
Lower power consumption

Problems:

Accuracy loss
Additional complexity

1. Post-training quantization

It happens once the model has been trained using full precision (FP32).

1.1. Dynamic quantization (or runtime quantization)

Only weights are quantized.
Activations are kept in FP32.
At runtime, activations are quantized based on the range observed for each input batch.

Pros:

No calibration data is needed

Cons:

This may imply a slight loss in accuracy

1.2. Static quantization

Both weights and activations are quantized
Uses a calibration process: with a small training/validation sample to observe the range of layer's activations. Normally INT8 is used

Pros:

Faster due to INT8 precision
Better accuracy

Cons:

More complex process

2. Quantization-Aware Training

The model simulates the quantization of weights and activations to help the model adapt to lower precisions.
"Fake" quantized nodes are used for forward passes, but not in backpropagation.

Pros:

High accuracy
Robust

Cons:

Complexity
Longer training time.

PreviousML Serving NextKernel Auto-Tuning

Last updated 1 year ago

Was this helpful?