I'm currently implementing Vosk Speech recognition into an application. Looking specifically at the speaker recognition, I've implemented the test_speaker.py from the examples and it is functional. Being new to this, how can I identify and/or create the reference speaker signature? Using the one provided, the list of distances calculated with my audio example doesn't portray the two speakers involved:
[1.0182311997728735, 0.8679279016022726, 0.8552687907177629, 1.0258941854519696, 0.8666933753723253, 0.9291881495586336, 1.0316585805917928, 1.0227699471036409, 0.8442800102809634, 0.9093189414477789, 0.9153723223264221, 0.9705387223260904, 0.9077720598812595, 0.9524431272217568, 0.9179475137290445]
If there is not an effective way to calculate a reference speaker from within the audio under analysis, do you know of another solution that can be used with Vosk to identify speakers in an audio file? If not, what other speech to text option would you suggest? (I've already played with google's)
Thanks in advance
I've been working with Vosk recently as well, and the way to create a new reference speaker is to extract the X-Vector output from the recognizer.
This is code from the python example that I adapted to put each utterance's X-Vector into a list called "vectorList".
In my program, I then use these vectors in the vector list as the reference speakers that are compared with other x-vectors in the cosine_dist function. The cosine_dist function returns a "speaker distance" that tells you how different the two x-vectors were.
In summary the program I'm developing does the following:
I'm no expert with Vosk, I should mention, and it is entirely possible there is a better way to go about this. This is just the way I've found to do it, based off of the example problem in the python directory.