Vosk Speaker Recognition

Question

Vosk Speaker Recognition

3.4k views Asked by rafadevi At 04 May 2021 at 14:38

I'm currently implementing Vosk Speech recognition into an application. Looking specifically at the speaker recognition, I've implemented the test_speaker.py from the examples and it is functional. Being new to this, how can I identify and/or create the reference speaker signature? Using the one provided, the list of distances calculated with my audio example doesn't portray the two speakers involved:

[1.0182311997728735, 0.8679279016022726, 0.8552687907177629, 1.0258941854519696, 0.8666933753723253, 0.9291881495586336, 1.0316585805917928, 1.0227699471036409, 0.8442800102809634, 0.9093189414477789, 0.9153723223264221, 0.9705387223260904, 0.9077720598812595, 0.9524431272217568, 0.9179475137290445]

If there is not an effective way to calculate a reference speaker from within the audio under analysis, do you know of another solution that can be used with Vosk to identify speakers in an audio file? If not, what other speech to text option would you suggest? (I've already played with google's)

Thanks in advance

Original Q&A

There are 1 answers

**Aaron Walker** · Answer 1 · 2021-10-22T15:30:19+00:00

I've been working with Vosk recently as well, and the way to create a new reference speaker is to extract the X-Vector output from the recognizer.

This is code from the python example that I adapted to put each utterance's X-Vector into a list called "vectorList".

    if recognizer.AcceptWaveform(data):
        res = json.loads(recognizer.Result())
        # print("Text:", res['text'])
        # Checks that X-Vector ('spk') is in the data file, res
        if 'spk' in res:
            # Append X-Vector to baseline list
            vectorList.append(res['spk'])

In my program, I then use these vectors in the vector list as the reference speakers that are compared with other x-vectors in the cosine_dist function. The cosine_dist function returns a "speaker distance" that tells you how different the two x-vectors were.

In summary the program I'm developing does the following:

Runs some "baseline" audio files through the recognizer to get the x-vectors
Store the x-vectors in a list
Run some testing audio files through the recognizer to get x-vectors to test with
Run each test x-vector against each "baseline" x-vector with the cosine_dist function
Average the speaker distances returned from cosine_dist to get the average speaker distance

I'm no expert with Vosk, I should mention, and it is entirely possible there is a better way to go about this. This is just the way I've found to do it, based off of the example problem in the python directory.

TechQA.

Vosk Speaker Recognition

There are 1 answers

Related Questions in NLP

Related Questions in SPEECH-RECOGNITION

Related Questions in TRANSCRIPTION

Related Questions in VOSK

Popular Questions

Popular Tags

Trending Questions