I compared the accuracy levels of both wav2vec and Microsoft STT on a few Ted talks.
Accuracy levels are word error rates
|Video |Wav2Vec|MicrosftSTT|
|:---: |:-----:|:---------:|
|[1][1]|8.57 |3.7 |
|[2][2]|13.83 |5.8 |
|[3][3]|20.7 |11.1 |
|[4][4]|12.5 |6.6 |
Microsoft beats Wav2vec by two times for every file. Isn't Wav2vec supposed to be state of the art? What am I missing here?
I used the 960hr big model provided in fairseq for generating the text