I'm working on an app for people stuck in superfluous meetings who need to know when someone asks them a question.
My plan is:
- Stream the audio of the meeting (what normally comes out of my speakers) into a speech-to-text program
- Stream that into something that watches for my name and/or rising intonation for questions
- Have the program "ding" when someone asks me a question. Then I can quickly read the text and answer.
The hard part is step (1). All the speech-to-text programs I found accept audio files as input, and cannot just stream from whatever channel goes to the speakers/headphones. Assistive programs I found, on the other hand, take over keyboard input. Ideally, users will be able to do productive work by typing in other apps during the meeting, so that kind of solution won't work.
So I'm looking for something I can use on OS X that will either handle step (1) or even better do most of the steps above for me.
I've done research into solutions and can't find anything for step (1). I'm including the other steps because there may be a more creative solution for the overall program (such as some other assistive technology not for dictation) that I don't know about.
You can use many APIs, for example the streaming API from Google, it is not totally free though.
If you tolerate lower accuracy you can use open source software like CMUSphinx.
The problem is also how to get audio stream from the voip software, you have to hack it yourself. Or you have to re-record what is played on speakers, it is not always a good idea.