Vosk Speaker Recognition

3.4k views Asked by At

I'm currently implementing Vosk Speech recognition into an application. Looking specifically at the speaker recognition, I've implemented the test_speaker.py from the examples and it is functional. Being new to this, how can I identify and/or create the reference speaker signature? Using the one provided, the list of distances calculated with my audio example doesn't portray the two speakers involved:

[1.0182311997728735, 0.8679279016022726, 0.8552687907177629, 1.0258941854519696, 0.8666933753723253, 0.9291881495586336, 1.0316585805917928, 1.0227699471036409, 0.8442800102809634, 0.9093189414477789, 0.9153723223264221, 0.9705387223260904, 0.9077720598812595, 0.9524431272217568, 0.9179475137290445]

If there is not an effective way to calculate a reference speaker from within the audio under analysis, do you know of another solution that can be used with Vosk to identify speakers in an audio file? If not, what other speech to text option would you suggest? (I've already played with google's)

Thanks in advance

1

There are 1 answers

2
Aaron Walker On

I've been working with Vosk recently as well, and the way to create a new reference speaker is to extract the X-Vector output from the recognizer.

This is code from the python example that I adapted to put each utterance's X-Vector into a list called "vectorList".

    if recognizer.AcceptWaveform(data):
        res = json.loads(recognizer.Result())
        # print("Text:", res['text'])
        # Checks that X-Vector ('spk') is in the data file, res
        if 'spk' in res:
            # Append X-Vector to baseline list
            vectorList.append(res['spk'])

In my program, I then use these vectors in the vector list as the reference speakers that are compared with other x-vectors in the cosine_dist function. The cosine_dist function returns a "speaker distance" that tells you how different the two x-vectors were.

In summary the program I'm developing does the following:

  • Runs some "baseline" audio files through the recognizer to get the x-vectors
  • Store the x-vectors in a list
  • Run some testing audio files through the recognizer to get x-vectors to test with
  • Run each test x-vector against each "baseline" x-vector with the cosine_dist function
  • Average the speaker distances returned from cosine_dist to get the average speaker distance

I'm no expert with Vosk, I should mention, and it is entirely possible there is a better way to go about this. This is just the way I've found to do it, based off of the example problem in the python directory.