I am working on detecting trigger words like "Hi Siri", "Ok Google" etc in Python. My approach is to record trigger and some output words in wav files. Then read them and extract features using pyAudioAnalysis. Lastly, I compare the cosine similarity of trigger word's features with the features extracted for the sliding window on output. The issue is that, for exact same words The code is given below:
def match_transcription(tar,out):
"returns list of correlations where both audio transcriptions match"
print(tar.shape) # trigger word's features (34,num of frames)
print(out.shape) # output sound's features (34,num of frames)
sims=[] # will have similarities for all features, for all chunks
for i in range(tar.shape[0]): # loop over all features
chunk_tar=tar[i] # pick one feature from target
chunk_out=out[i] # pick same feature from output
sims1=[]
chunk_outs=window(chunk_out,tar.shape[1]) # generate sliding window for ouput features
for chunk in chunk_outs: # loop over all output features
sim = 1 - spatial.distance.cosine(chunk, chunk_tar) # calculate cosine similarity between target and output features
sims1.append(sim) # add similarities to list
sims.append(np.array(sims1))
sims=np.array(sims)
means=np.mean(sims,axis=0) # take mean of all frames features
print(sims)
print(means)
the output is like:
Mean Similarity: [0.25522565 0.25120983 0.25925772 0.27925796 0.28873657 0.289228
0.3081794 0.3477496 0.33269364 0.34055122 0.34868945 0.33925324
0.34162649 0.32976345 0.32332807 0.33668049 0.34458411 0.36058285
0.37208687 0.37574359 0.400042 0.40289759 0.3872925 0.35079805
0.36320806 0.36803756 0.35871608 0.35921478 0.36508046 0.39065785
0.40899824 0.43283008 0.43767465 0.42003872 0.41108351 0.41531505
0.39725584 0.38569253 0.35555717 0.36983754 0.37081652 0.39188315]
The output shows that all sliding windows of output have very little similarity with the trigger word where the same words were spoken.
The feature extraction is the same for trigger words and output sounds like:
def get_features(f_name):
"returns short term features from the audio"
[Fs, x] = audioBasicIO.readAudioFile(f_name)
F, f_names = stFeatureExtraction(x, Fs, 0.050*Fs, 0.025*Fs)
return F,f_names
F1,f1_names=get_features('trigger_word.wav') # done for all output sounds as well
My question is that which of 34 features are relevant for checking the similarity between trigger words and output sounds? Or is there any other way with which the same job can be performed in python. Thanks!