Quantization
A technique that reduces the numerical precision of a model's weights and activations. Typically, from FP32 to lower precision formats like FP16, INT8, or INT4.
Benefits:
Smaller model size
Faster inference speed
Lower power consumption
Problems:
Accuracy loss
Additional complexity
1. Post-training quantization
It happens once the model has been trained using full precision (FP32).
1.1. Dynamic quantization (or runtime quantization)
Only weights are quantized.
Activations are kept in FP32.
At runtime, activations are quantized based on the range observed for each input batch.
Pros:
No calibration data is needed
Cons:
This may imply a slight loss in accuracy
1.2. Static quantization
Both weights and activations are quantized
Uses a calibration process: with a small training/validation sample to observe the range of layer's activations. Normally INT8 is used
Pros:
Faster due to INT8 precision
Better accuracy
Cons:
More complex process
2. Quantization-Aware Training
The model simulates the quantization of weights and activations to help the model adapt to lower precisions.
"Fake" quantized nodes are used for forward passes, but not in backpropagation.
Pros:
High accuracy
Robust
Cons:
Complexity
Longer training time.
Last updated