I fine-tuned a dataset of Nvidia Tacotron2. While working reasonably well, there are some mispronounciations of words(I train a german dataset).
I have another set of wave files by the same speaker with according metadata.csv
How do I filter this to include mainly those sentences that teach the model the very pronounciations that are missing?