I'm doing my project with tensorflow2 and tf-trt(tensorRT) module for deep learning accleration.
I used tf-trt to quantize pretrained deep neural network(FP32->FP16), and its latency reduction is amazing.
as far as i know, when quantize from floating 32 to floating 16, model size must be half of original.
but there is one problem that model size didn't changed but model size has been increased.(i have also used tensorflow lite to quantize the network, and it works well(reducing model size into half)
Do you guys know why the model size is increased ?
Thanks!