I'm working on quantization in transfer learning using MobilenetV2 for personal dataset. There are 2 approaches that I have tried:
i.) Only post training quantization: It is working fine and is producing 0.04s average time for inference of 60 images at 224,224 dimensions.
ii.) Quantization aware training + post training Quantization: It is producing greater accuracy than post training quantization only but is producing a higher inference time of 0.55s for the same 60 images.
1.) Only post training quantization model(.tflite) can be inferenced by:
images_ = cv2.resize(cv2.cvtColor(cv2.imread(imagepath), cv2.COLOR_BGR2RGB), (224, 224))
images = preprocess_input(images_)
interpreter.set_tensor(
interpreter.get_input_details()[0]['index'], [x])
interpreter.invoke()
classes = interpreter.get_tensor(
interpreter.get_output_details()[0]['index'])
2.) Quantization aware training + post training quantization can be inferenced by the below code. The difference is that here it asks for float32 input.
images_ = cv2.resize(cv2.cvtColor(cv2.imread(imagepath), cv2.COLOR_BGR2RGB), (224, 224))
images = preprocess_input(images_)
x = np.expand_dims(images, axis=0).astype(np.float32)
interpreter.set_tensor(
interpreter.get_input_details()[0]['index'], x)
interpreter.invoke()
classes = interpreter.get_tensor(
interpreter.get_output_details()[0]['index'])
I have searched a lot but didn't got any response for this query. If possible please help with why I'm getting the inference time high in case of quantization aware training + post training quantization compared to only post training quantization?
I don't think you should do quantization aware training + post training quantization to together.
According to https://www.tensorflow.org/model_optimization/guide/quantization/training_example, If you use quantization aware training, the conversion will give you a model with int8 weights. So, there is no point to do the post training quantization here.