I have large dataset of call centers records in kazakh language. I want to build speaker diarization system. So what pre-trained model can be useful for fine-tuning and inferencing? My dataset contains wav files and json files timestamps with periods when the operator is talking and when customers are talking.
Suggestions about speaker diarization models.