Quantization aware training in tensorflow 2.2.0 producing higher inference time

529 views Asked by At

I'm working on quantization in transfer learning using MobilenetV2 for personal dataset. There are 2 approaches that I have tried:

i.) Only post training quantization: It is working fine and is producing 0.04s average time for inference of 60 images at 224,224 dimensions.

ii.) Quantization aware training + post training Quantization: It is producing greater accuracy than post training quantization only but is producing a higher inference time of 0.55s for the same 60 images.

1.) Only post training quantization model(.tflite) can be inferenced by:

        images_ = cv2.resize(cv2.cvtColor(cv2.imread(imagepath), cv2.COLOR_BGR2RGB), (224, 224))
        images = preprocess_input(images_)
        interpreter.set_tensor(
                    interpreter.get_input_details()[0]['index'], [x])
        interpreter.invoke()
        classes = interpreter.get_tensor(
            interpreter.get_output_details()[0]['index'])

2.) Quantization aware training + post training quantization can be inferenced by the below code. The difference is that here it asks for float32 input.

        images_ = cv2.resize(cv2.cvtColor(cv2.imread(imagepath), cv2.COLOR_BGR2RGB), (224, 224))
        images = preprocess_input(images_)
        x = np.expand_dims(images, axis=0).astype(np.float32)
        interpreter.set_tensor(
                    interpreter.get_input_details()[0]['index'], x)
        interpreter.invoke()
        classes = interpreter.get_tensor(
            interpreter.get_output_details()[0]['index'])

I have searched a lot but didn't got any response for this query. If possible please help with why I'm getting the inference time high in case of quantization aware training + post training quantization compared to only post training quantization?

2

There are 2 answers

4
Thaink On

I don't think you should do quantization aware training + post training quantization to together.

According to https://www.tensorflow.org/model_optimization/guide/quantization/training_example, If you use quantization aware training, the conversion will give you a model with int8 weights. So, there is no point to do the post training quantization here.

0
Louis Yang On

I think the part that converts from uint8 to float32 (.astype(np.float32)) is what makes it slower. Otherwise, they should be at the same speed.