7. Quantized Awareness Training (QAT)¶

The quantization indicates a technique for performing computations and storing tensors with the bit widths lower than the floating-point accuracy. Quantized models perform some or all operations on the tensor using integers rather than floating-point values. Compared to typical FP32 models, horizon_plugin_pytorch supports INT8 quantization, resulting in a 4x reduction in model size and a 4x reduction in memory bandwidth requirements. The hardware support for INT8 computation is typically 2 to 4 times faster than FP32 computation. The quantization is primarily a technique to accelerate the inference, and the quantization operations are only supported for forward computation.

horizon_plugin_pytorch provides the BPU-adapted quantization operations and supports quantization-aware training (QAT). The QAT uses fake-quantization modules to model the quantization errors in forward computation and backpropagation. Note that the computation process of the QAT is performed by using floating-point operations. At the end of the QAT, horizon_plugin_pytorch provides the conversion functions to convert the trained model to a fixed-point model, using a more compact model for representation and high-performance vectorization on the BPU.

This section gives you a detailed introduction to horizon_plugin_pytorch quantitative training tool developed on the basis of PyTorch.