I am working on school project that requires me to perform manual quantization of each layer of a model. Specifically, I want to implement manually:
Quantized activation, combined with quantized weight A - layer A - quantized output - dequantized output - requantized output, combined with quantized weight B - layer B - ...
I know Pytorch already have a quantization function, but that function is limited to int8. I would like to perform quantization from bit = 16 to bit = 2, and then compare their accuracy.
The issue I encountered is that after quantization, the output of a layer is multi-magnitude larger (with bit = 16), and I don't know how to dequantize it back. I am performing the quantization with the same min and max of both activation and weight. So here is an example:
Activation = [1,2,3,4]
Weight = [5,6,7,8]
Min and max across activation and weight = 1, 8
Expected, non-quantized output = 70
Quantize with bit = 16
Quantized activation = [-32768, -23406, -14044, -4681]
Quantized weight = [4681, 14043, 23405, 32767]
Quantized output = -964159613
Dequantize output with min = 1, max = 8 = -102980
The calculation makes sense to me, because the output involves multiplying activations and weigths, their magnitude increase is also multiplied together. If I perform dequantization once with the original min and max, it is reasonable to have a much larger output.
How does Pytorch handle dequantization? I attempted to locate the quantization of Pytorch, but I could not locate it. How to dequantize the output?
I think there may be an issue with your formula for calculating the dequantized output.
Calculate output: