I would like to know how to use cublasGemmEx to infer a .pth model trained by pytorch with int8 quantization.
I tried torch.quantization.quantize_dynamic and it seems that it doesn't work on the CUDA. I also tried to convert model to onnx, but it runs very slow, and "] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance." warning was thrown.