how to find a word spoken in a speech in python

304 views Asked by At

I am working on detecting trigger words like "Hi Siri", "Ok Google" etc in Python. My approach is to record trigger and some output words in wav files. Then read them and extract features using pyAudioAnalysis. Lastly, I compare the cosine similarity of trigger word's features with the features extracted for the sliding window on output. The issue is that, for exact same words The code is given below:

def match_transcription(tar,out):
    "returns list of correlations where both audio transcriptions match"
    print(tar.shape)  # trigger word's features (34,num of frames)
    print(out.shape)  # output sound's features (34,num of frames)
    sims=[]  # will have similarities for all features, for all chunks
    for i in range(tar.shape[0]):   # loop over all features
        chunk_tar=tar[i]  # pick one feature from target
        chunk_out=out[i]   # pick same feature from output
        sims1=[]
        chunk_outs=window(chunk_out,tar.shape[1])  # generate sliding window for ouput features

        for chunk in chunk_outs:  # loop over all output features
            sim = 1 - spatial.distance.cosine(chunk, chunk_tar)   # calculate cosine similarity between target and output features
            sims1.append(sim)   # add similarities to list
        sims.append(np.array(sims1))
    sims=np.array(sims)
    means=np.mean(sims,axis=0)  # take mean of all frames features
    print(sims)
    print(means)

the output is like:

Mean Similarity:  [0.25522565 0.25120983 0.25925772 0.27925796 0.28873657 0.289228
 0.3081794  0.3477496  0.33269364 0.34055122 0.34868945 0.33925324
 0.34162649 0.32976345 0.32332807 0.33668049 0.34458411 0.36058285
 0.37208687 0.37574359 0.400042   0.40289759 0.3872925  0.35079805
 0.36320806 0.36803756 0.35871608 0.35921478 0.36508046 0.39065785
 0.40899824 0.43283008 0.43767465 0.42003872 0.41108351 0.41531505
 0.39725584 0.38569253 0.35555717 0.36983754 0.37081652 0.39188315]

The output shows that all sliding windows of output have very little similarity with the trigger word where the same words were spoken.

The feature extraction is the same for trigger words and output sounds like:

def get_features(f_name):
    "returns short term features from the audio"
    [Fs, x] = audioBasicIO.readAudioFile(f_name)
    F, f_names = stFeatureExtraction(x, Fs, 0.050*Fs, 0.025*Fs)
    return F,f_names
F1,f1_names=get_features('trigger_word.wav')  # done for all output sounds as well

My question is that which of 34 features are relevant for checking the similarity between trigger words and output sounds? Or is there any other way with which the same job can be performed in python. Thanks!

0

There are 0 answers