I'm trying to feed audio from an online communication app into the Vosk speech recognition API.
The audio comes in form of a byte array and with this audio format PCM_SIGNED 48000.0 Hz, 16 bit, stereo, 4 bytes/frame, big-endian
.
In order to be able to process it with Vosk, it needs to be mono
and little-endian
.
This is my current attempt:
byte[] audioData = userAudio.getAudioData(1);
short[] convertedAudio = new short[audioData.length / 2];
ByteBuffer buffer = ByteBuffer.allocate(convertedAudio.length * Short.BYTES);
// Convert to mono, I don't think I did it right though
int j = 0;
for (int i = 0; i < audioData.length; i += 2)
convertedAudio[j++] = (short) (audioData[i] << 8 | audioData[i + 1] & 0xFF);
// Convert to little endian
buffer.order(ByteOrder.BIG_ENDIAN);
for (short s : convertedAudio)
buffer.putShort(s);
buffer.order(ByteOrder.LITTLE_ENDIAN);
buffer.rewind();
for (int i = 0; i < convertedAudio.length; i++)
convertedAudio[i] = buffer.getShort();
queue.add(convertedAudio);
I had this same problem and found this stackoverflow post that converts the raw pcm byte array into an audio input stream.
I assume you're using Java Discord API (JDA), so here's my initial code I have for the 'handleUserAudio()' function that utilizes vosk, and the code in the link I provided above:
This works thus far, however, it throws everything into the '.getPartialResult()' method of the recognizer, but at least vosk is understanding the audio coming from the discord bot.