Just doing some general research. Are there any open source (or even paid?) tools / programs that do the following:
INPUT: an audio file of some unlabeled speech, maybe a few sentences long, (no indication as to what the phonetic transcriptions are in the audio)
OUTPUT: an audio file with phonetic transcriptions (in the IPA alphebet) aligned and labeled on the audio
Is this possible to be done with just a phonetic dictionary and without a word dictionary?
Sphinx has an all phone feature that will produce this kind of output hypothesis. But most speech recognition is improved strongly by utilization of a phonetic dictionary and n-gram language model. It's possible to use those things in the creation of the hypothesis and then convert that in to labeled aligned phonemes with Sphinx.
Here is an example for just phonetic stuff.
http://cmusphinx.sourceforge.net/wiki/phonemerecognition
But I have been out of the speech rec game for a long time. I believe most people are pursuing neural nets now for these kinds of concepts and I don't know any open neural nets in that space.