How to turn a PCM byte array into little-endian and mono?

1.2k views Asked by At

I'm trying to feed audio from an online communication app into the Vosk speech recognition API.

The audio comes in form of a byte array and with this audio format PCM_SIGNED 48000.0 Hz, 16 bit, stereo, 4 bytes/frame, big-endian. In order to be able to process it with Vosk, it needs to be mono and little-endian.

This is my current attempt:

        byte[] audioData = userAudio.getAudioData(1);
        short[] convertedAudio = new short[audioData.length / 2];
        ByteBuffer buffer = ByteBuffer.allocate(convertedAudio.length * Short.BYTES);
        
        // Convert to mono, I don't think I did it right though
        int j = 0;
        for (int i = 0; i < audioData.length; i += 2)
            convertedAudio[j++] = (short) (audioData[i] << 8 | audioData[i + 1] & 0xFF);

        // Convert to little endian
        buffer.order(ByteOrder.BIG_ENDIAN);
        for (short s : convertedAudio)
            buffer.putShort(s);
        buffer.order(ByteOrder.LITTLE_ENDIAN);
        buffer.rewind();

        for (int i = 0; i < convertedAudio.length; i++)
            convertedAudio[i] = buffer.getShort();

        queue.add(convertedAudio);
2

There are 2 answers

1
Aaron Walker On BEST ANSWER

I had this same problem and found this stackoverflow post that converts the raw pcm byte array into an audio input stream.

I assume you're using Java Discord API (JDA), so here's my initial code I have for the 'handleUserAudio()' function that utilizes vosk, and the code in the link I provided above:

                // Define audio format that vosk uses
            AudioFormat target = new AudioFormat(
                    16000, 16, 1, true, false);

            try {
                byte[] data = userAudio.getAudioData(1.0f);
                // Create audio stream that uses the target format and the byte array input stream from discord
                AudioInputStream inputStream = AudioSystem.getAudioInputStream(target,
                        new AudioInputStream(
                                new ByteArrayInputStream(data), AudioReceiveHandler.OUTPUT_FORMAT, data.length));

                // This is what was used before
//                InputStream inputStream = new ByteArrayInputStream(data);

                int nbytes;
                byte[] b = new byte[4096];
                while ((nbytes = inputStream.read(b)) >= 0) {
                    if (recognizer.acceptWaveForm(b, nbytes)) {
                        System.out.println(recognizer.getResult());
                    } else {
                        System.out.println(recognizer.getPartialResult());
                    }
                }
//                queue.add(data);
            } catch (Exception e) {
                e.printStackTrace();
            }

This works thus far, however, it throws everything into the '.getPartialResult()' method of the recognizer, but at least vosk is understanding the audio coming from the discord bot.

4
Phil Freihofner On

Signed PCM is certainly supported. The problem is that 48000 fps is not. I think the highest frame rate supported by Java directly is 44100.

As to what course of action to take, I'm not sure what to recommend. Maybe there are libraries that can be employed? It is certainly possible to do the conversions manually with the byte data directly, where you enforce the expected data formats.

I can write a bit more about the conversion process itself (assembling bytes into PCM, manipulating the PCM, creating bytes from PCM), if requested. Is the VOSK expecting 48000 fps also?


Going from stereo to mono is a matter of literally taking the sum of the left and right PCM values. It is common to add a step to ensure the range is not exceeded. (16-bit range if PCM is coded as normalized floats = -1 to 1, range if PCM is coded as shorts = -32768 to 32767.)

Following code fragment is an example of taking a single PCM value (signed float, normalized to range between -1 and 1) and generating two bytes (16-bits) in little endian order. The array buffer is of type float and holds the PCM values. The array audioBytes is of type byte.

buffer[i] *= 32767;
        
audioBytes[i*2] = (byte) buffer[i];
audioBytes[i*2 + 1] = (byte)((int)buffer[i] >> 8 );

To make it big endian, just swap the indexes of audioBytes, or the operations (byte) buffer[i] and (byte)((int)buffer[i] >> 8 ). This code is from the class AudioCue, a class that I wrote that functions as an enhanced Clip. See lines 1391-1394.

I think you can extrapolate the reverse process (converting incoming bytes to PCM). But here is an example of doing this, from the code lines 391-393. In this case temp is a float array that will hold the PCM values that are calculated from the byte stream. In my code, the value will soon be divided by 32767f to make it normalized. (line 400)

temp[clipIdx++] = ( buffer[bufferIdx++] & 0xff ) | ( buffer[bufferIdx++] << 8 ) ;

For big endian, you would reverse the order of & 0xff and << 8.

How you iterate through the structures is up to your personal preference. IDK that I've picked the optimal methods here. For your situation, I'd be tempted to hold the PCM value in a short (ranging from -32768 to 32767) instead of normalizing to -1 to 1 floats. Normalizing makes more sense if you are engaged in processing audio data from multiple sources. But the only processing you are going to do is add the left and right PCM together to get your mono value. It's good, by the way, after summing left and right, to ensure the numerical range isn't exceeded--as that can create some pretty harsh distortion.