My daughter and I are building a robot horse. One design goal is to use speech-recognition to recognize commands given to the horse and respond accordingly. Since most of the commands are barely English words, I need something that I can create custom words in. I've got some experience in my day job with Kaldi-ASR, so I figured I would investigate its capabilities first.
The recognition grammar would consist of a few commands: Walk (Walk or two kissing sounds), Trot (TT-ro-TT), Gallop (Gee-yup), Stop (whoa), a "Go faster" command (come on or clucking the tongue) plus the horse's name and a few phrases like "good boy" and a few sounds like clucking your tongue.
The hardware it would run on will be limited, probably a Raspberry Pi 4. (But I could be talked into something beefier if there was a significant speed benefit for this type of recognition.)
The first challenge is that horse commands are given with a lot of different emphases, cadences, and accents even given by the same people. E.G. Giddy-up could be pronounced like Giddy Up, GEE up, EE-YUP, gee-UP, etc.
The second is that some horse commands aren't words: Clucking your tongue, Kissing sounds are two major ones.
First question: Is Kaldi going to be a good fit for this? (I use it, but know little about the theory behind it.) Does it handle numerous pronunciations well? Can it work for non-word utterances like clucking the tongue or making kissing sounds? If not is there a better recognition engine for this type of recognition?
Second question: How do I handle the various pronunciations? Consider them different words and train them separately, or will Kaldi be able to handle it if I give it lots of sample data to train on? IOW, would splitting the pronunciations into different words give better recognition, or using single words trained with lots of variation in their training audio?
Any additional hints on how best to train for these types of sounds appreciated as well.