I am learning how to use Sphinx4 using the Maven plug-in for Eclipse.
I took the transcribe demo found on GitHub and altered it to process a file of my own. The audio file is 16bit, mono, 16khz. It is approximately 13 seconds long. I noticed that it sounds like it is in slow motion.
The words spoken in the file are, "also make sure it's easy for you to access the recording files so you could upload it if asked".
I am attempting to transcribe the file and my results are horrendous. My attempts at finding forum posts or links that thoroughly explain how to improve the results, or what I am not doing correctly have lead me no where.
I am looking to strengthen the accuracy of the transcription, but would like to avoid having to train a model myself due to the variance in the type of data that my current project will have to deal with. Is this not possible, and is the code I am using off?
CODE
(NOTE: Audio file available at https://instaud.io/8qv)
public class App {
public static void main(String[] args) throws Exception {
System.out.println("Loading models...");
Configuration configuration = new Configuration();
// Load model from the jar
configuration
.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
// You can also load model from folder
// configuration.setAcousticModelPath("file:en-us");
configuration
.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
configuration
.setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.dmp");
StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(
configuration);
FileInputStream stream = new FileInputStream(new File("/home/tmscanlan/workspace/example/vocaroo_test_revised.wav"));
// stream.skip(44); I commented this out due to the short length of my file
// Simple recognition with generic model
recognizer.startRecognition(stream);
SpeechResult result;
while ((result = recognizer.getResult()) != null) {
// I added the following print statements to get more information
System.out.println("\ngetWords() before loop: " + result.getWords());
System.out.format("Hypothesis: %s\n", result.getHypothesis());
System.out.print("\nThe getResult(): " + result.getResult()
+ "\nThe getLattice(): " + result.getLattice());
System.out.println("List of recognized words and their times:");
for (WordResult r : result.getWords()) {
System.out.println(r);
}
System.out.println("Best 3 hypothesis:");
for (String s : result.getNbest(3))
System.out.println(s);
}
recognizer.stopRecognition();
// Live adaptation to speaker with speaker profiles
stream = new FileInputStream(new File("/home/tmscanlan/workspace/example/warren_test_smaller.wav"));
// stream.skip(44); I commented this out due to the short length of my file
// Stats class is used to collect speaker-specific data
Stats stats = recognizer.createStats(1);
recognizer.startRecognition(stream);
while ((result = recognizer.getResult()) != null) {
stats.collect(result);
}
recognizer.stopRecognition();
// Transform represents the speech profile
Transform transform = stats.createTransform();
recognizer.setTransform(transform);
// Decode again with updated transform
stream = new FileInputStream(new File("/home/tmscanlan/workspace/example/warren_test_smaller.wav"));
// stream.skip(44); I commented this out due to the short length of my file
recognizer.startRecognition(stream);
while ((result = recognizer.getResult()) != null) {
System.out.format("Hypothesis: %s\n", result.getHypothesis());
}
recognizer.stopRecognition();
System.out.println("...Printing is done..");
}
}
Here is the output (a photo album I took): https://i.stack.imgur.com/RWkLy.jpg
As Nikolay says, the audio sounds odd, probably because you haven't resampled it in the right way. To downsample the audio from the original 22050 Hz to the desired 16kHz, you can run the following command:
The Vocaroo16.wav will sounds much better and it will (probably) give you better ASR results.