I was reading about quantization (specifically abount int8) and trying to figure it out if there is a method to avoid dequantize and requantize the output of a node before feeding it to the next one. So i eventually find the definition of static and dynamic quantization. According to onnxruntime:

Dynamic quantization calculates the quantization parameters (scale and zero point) for activations dynamically. [...] Static quantization method first runs the model using a set of inputs called calibration data. During these runs, we compute the quantization parameters for each activations. These quantization parameters are written as constants to the quantized model and used for all inputs.

To me that seem quite clear, saying that the difference between the two methods is about when (de)quantization parameters are computed (with dynamic doing it at inference time and static doing it before inference and hardcoding them in the model) and not about the actual (de)quantization process.

However I got in touch with some articles/forum answers which seems to point to a different direction. This article say about static quantization:

[...] Importantly, this additional step allows us to pass quantized values between operations instead of converting these values to floats - and then back to ints - between every operation, resulting in a significant speed-up.

It seems to be arguing that static quantization does not require to apply dequantize and then quantize operations to the output of a node before feeding it as input to the next one. I also found a discussion arguing the same:

Q: [...] However, our hardware colleagues told me that because it has FP scales and zero-points in channels, the hardware should still support FP in order to implement it. They also argued that in each internal stage, the values (in-channels) should be dequantized and converted to FP and quantized again for the next layer. [...]

A: For the first argument you are right, since scales and zero-points are FP, hardware need to support FP for the computation. The second argument may not be true, for static quantization the output of the previous layer can be fed into next layer without dequantizing to FP. Maybe they are thinking about dynamic quantization, which keeps tensors between two layers in FP.

And others have aswered the same.

So I tried out to manually quantize a model using onnxruntime.quantization.quantize_static. Before going on I have to make a premise: I'm not in the field of AI, and I'm learning about the topic for another purpose. So I googled to find out how to do that and I managed to get it done with the following code:

import torch
import torchvision as tv
import onnxruntime
from onnxruntime import quantization


MODEL_PATH = "best480x640.onnx"
MODEL_OPTIMIZED_PATH = "best480x640_optimized.onnx"
QUANTIZED_MODEL_PATH = "best480x640_quantized.onnx"


class QuntizationDataReader(quantization.CalibrationDataReader):
    def __init__(self, torch_ds, batch_size, input_name):

        self.torch_dl = torch.utils.data.DataLoader(
            torch_ds, batch_size=batch_size, shuffle=False)

        self.input_name = input_name
        self.datasize = len(self.torch_dl)

        self.enum_data = iter(self.torch_dl)

    def to_numpy(self, pt_tensor):
        return (pt_tensor.detach().cpu().numpy() if pt_tensor.requires_grad
                else pt_tensor.cpu().numpy())

    def get_next(self):
        batch = next(self.enum_data, None)
        if batch is not None:
            return {self.input_name: self.to_numpy(batch[0])}
        else:
            return None

    def rewind(self):
        self.enum_data = iter(self.torch_dl)


preprocess = tv.transforms.Compose([
    tv.transforms.Resize((480, 640)),
    tv.transforms.ToTensor(),
    tv.transforms.Normalize(
        mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

ds = tv.datasets.ImageFolder(root="./calib/", transform=preprocess)

# optimisations
quantization.shape_inference.quant_pre_process(
    MODEL_PATH, MODEL_OPTIMIZED_PATH, skip_symbolic_shape=False)

quant_ops = {"ActivationSymmetric": False, "WeightSymmetric": True}
ort_sess = onnxruntime.InferenceSession(
    MODEL_PATH, providers=["CPUExecutionProvider"])
qdr = QuntizationDataReader(
    ds, batch_size=1, input_name=ort_sess.get_inputs()[0].name)
quantized_model = quantization.quantize_static(
    model_input=MODEL_OPTIMIZED_PATH,
    model_output=QUANTIZED_MODEL_PATH,
    calibration_data_reader=qdr,
    extra_options=quant_ops
)

However results confused me more. The following images show a chunk of the two models graphs (the "original" one and the quantized one) on netron. This is the non quantized model graph.

enter image description here

While this is the quantized one. enter image description here

The fact that it added QuantizeLinear/DequantizeLinear nodes may indicate the answer I'm looking for. However, the way those nodes are placed makes no sense to me: it computes dequantization immediately after quantization, so the input types of various Conv, Mul, etc nodes is still float32 tensors. I'm sure I'm missing (or misunderstanding) something here, so I can't figure out what I was originally looking for: does static quantization allow to feed a node with the still quantized output of the previous one? And what I'm getting wrong with the quantization process above?

0

There are 0 answers