Speech recognition from wav file or from precessed raw audio buffer

1.7k views Asked by At

I am working on an android project where I need to speech to text from audio buffer raw data or from a stored wav-file. Is it possible to do this on android? More specifically I get audio buffers from here

record.read(audioBuffer, 0, audioBuffer.length);

I process the audio buffer and store it as a wave file. I need to convert the processed audio buffer to text or after the audio buffer file has been saved as a wave file can I then convert the wav to text using googles offline speech to text option. Please let me know how do I do this. I have seen other threads here but they are very old. (like 4,6,7 years old....)

2

There are 2 answers

1
Homer Wang On

Since Android 13, SpeechRecognizer can accept file or real time PCM data as input. I managed to write a project to successfully make it work.

At this moment, there is a trick that the sample rate of SpeechRecognizer seem not to work on every rate. For example, I recorded an PCM clip with 22050hz, But if I set EXTRA_AUDIO_SOURCE_SAMPLING_RATE to 22050 the SpeechRecognizer will fail. Change to 16000 and 24000, the same audio clip can be recognized.

Here is how my test project working. I omitted the RECORDING_AUDIO permission part, just turn on the permission in the Android Phone Setting after first crash:

Part 0. Record a PCM raw file of English speech, Linear 16 bits Little Endian, I am using 22050hz sample rate. Put the file at res/raw/test.pcm

Part 1. Create an AndroidStudio project. In manifests, add following at the end of the root tag:

   <manifest xmlns:...
       ...
       <uses-permission android:name="android.permission.INTERNET" />
       <uses-permission android:name="android.permission.RECORD_AUDIO" />
       <queries>
           <intent>
               <action android:name="android.speech.RecognitionService" />
           </intent>
       </queries> 
   </manifest> 

Part 2. Add all following code blocks in MainActivity class. i. Variables

   // toggle either function of this sample project
   // 1 for PCM file in res/raw
   // 2 for real time PCM data from AudioRecord
   static final int AUDIO_SOURCE_TYPE = 1; 
   android.speech.SpeechRecognizer speechRecognizer = null;
   ParcelFileDescriptor[] m_audioPipe;
   ParcelFileDescriptor mExtraAudioPFD;
   ParcelFileDescriptor.AutoCloseOutputStream mOutputStream;
   AudioRecord audioRec;
   Thread m_hAutoRecordThread;
   boolean m_bTerminateThread;

ii. Functions for the life cycle of SpeechRecognizer, note: sample rate works on 16000 and 24000, not on 22050, though the original source is recorded with 22050Hz

@RequiresApi(api = Build.VERSION_CODES.TIRAMISU)
private final Intent createSpeechRecognizerIntent() {

    final Intent speechRecognizerIntent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);
    speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL,RecognizerIntent.LANGUAGE_MODEL_FREE_FORM);
    speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_SPEECH_INPUT_COMPLETE_SILENCE_LENGTH_MILLIS, 3000);
    speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_SPEECH_INPUT_MINIMUM_LENGTH_MILLIS, 6000);
    speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_SPEECH_INPUT_POSSIBLY_COMPLETE_SILENCE_LENGTH_MILLIS, 2000);
    speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_PARTIAL_RESULTS, true);
    speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, "en-US");

    if (AUDIO_SOURCE_TYPE == 1) {
        speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_AUDIO_SOURCE, mExtraAudioPFD);
    } else if (AUDIO_SOURCE_TYPE == 2) {
        speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_AUDIO_SOURCE, m_audioPipe[0]);
    }
    speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_AUDIO_SOURCE_CHANNEL_COUNT, 1);
    speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_AUDIO_SOURCE_ENCODING, AudioFormat.ENCODING_PCM_16BIT);
    speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_AUDIO_SOURCE_SAMPLING_RATE, 24000); 


    return speechRecognizerIntent;
}

protected void initRecognizer() {

    speechRecognizer = android.speech.SpeechRecognizer.createSpeechRecognizer(this);
    speechRecognizer.setRecognitionListener(new RecognitionListener() {
        @Override public void onReadyForSpeech(Bundle bundle) { Log.i("recognizer", "onReadyForSpeech"); }
        @Override public void onBeginningOfSpeech() { Log.i("recognizer", "onBeginningOfSpeech"); }
        @Override public void onRmsChanged(float v) {
            Log.i("onRmsChanged", "v = " + v);
        }
        @Override public void onBufferReceived(byte[] bytes) { ; }
        @Override public void onEndOfSpeech() {
            Log.i("recognizer", "onEndOfSpeech");
            stopRecognizer();
        }
        @Override public void onError(int i) { Log.i("recognizer", "onError = " + i); }
        @Override public void onResults(Bundle bundle) {

            Log.i("recognizer", "onResults");
            final ArrayList<String> data = bundle.getStringArrayList(android.speech.SpeechRecognizer.RESULTS_RECOGNITION);

            if (data != null && data.size() > 0) {
                String resultData = data.get(0);
                Log.i("SpeechRecogn", "resultData = " + resultData + ", data.get(0) = " + data.get(0));
            }
        }
        @Override public void onPartialResults(Bundle bundle) {

            Log.i("recognizer", "onPartialResults");
            final ArrayList<String> data = bundle.getStringArrayList(android.speech.SpeechRecognizer.RESULTS_RECOGNITION);

            if (data != null && data.size() > 0) {
                String resultData = data.get(0);
                Log.i("SpeechRecogn", "resultData = " + resultData + ", data.get(0) = " + data.get(0));
            }
        }
        @Override public void onEvent(int i, Bundle bundle) { Log.i("recognizer", "onEvent"); }
    });
}

void stopRecognizer() {

    m_bTerminateThread = true;
    new Handler(Looper.getMainLooper()).post(new Runnable() {
        @Override
        public void run() {

            if (speechRecognizer != null) {
                speechRecognizer.stopListening();
                try {
                    if (mOutputStream != null) {
                        mOutputStream.close();
                        mOutputStream = null;
                    }
                } catch (IOException e) {
                    ;
                }
                speechRecognizer = null;
            }
        }
    });
}

iii. AudioRecord Thread, works when you choose real time PCM data

private class RecordingRunnable implements Runnable {

    @Override
    public void run() {
        while (!m_bTerminateThread) {

            short[] readBuf = new short[1024];
            int readLength = audioRec.read(readBuf, 0, readBuf.length);

            byte[] readBytes = ShortArrayToByteArray(readBuf);
            try {
                if (mOutputStream != null) {
                    mOutputStream.write(readBytes, 0, readBytes.length);
                    mOutputStream.flush();
                }
            } catch (IOException e) {
                ;
            }
        }
    }
}

iv. Utility functions

protected byte[] ShortArrayToByteArray(short[] sa) {
    byte[] ret = new byte[sa.length * 2];

    ByteBuffer.wrap(ret).order(ByteOrder.LITTLE_ENDIAN).asShortBuffer().put(sa);
    return ret;
}

// function referenced from
// [https://stackoverflow.com/questions/8664468/copying-raw-file-into-sdcard/46244121#46244121][1]
private String copyFiletoStorage(int resourceId, String resourceName){
    String filePath = getFilesDir().getPath() + "/" + resourceName;
    try{
        InputStream in = getResources().openRawResource(resourceId);
        FileOutputStream out = null;
        out = new FileOutputStream(filePath);
        byte[] buff = new byte[1024];
        int read = 0;
        try {
            while ((read = in.read(buff)) > 0) {
                out.write(buff, 0, read);
            }
        } finally {
            in.close();
            out.close();
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
    return filePath;
}

v. Main function in onStart()

@Override
protected void onStart() {

    super.onStart();

    if (AUDIO_SOURCE_TYPE == 1) {
        try {
            String testFilePath = copyFiletoStorage(R.raw.test, "test.pcm");
            mExtraAudioPFD = ParcelFileDescriptor.open(new File(testFilePath), ParcelFileDescriptor.MODE_READ_ONLY);
        } catch (FileNotFoundException e) {
            mExtraAudioPFD = null;
        }
    } else if (AUDIO_SOURCE_TYPE == 2) {

        try {
            m_audioPipe = ParcelFileDescriptor.createPipe();
        } catch (IOException e) {
            finishAndRemoveTask();
        }

        mOutputStream = new ParcelFileDescriptor.AutoCloseOutputStream(m_audioPipe[1]);
    }

    initRecognizer();

    if (AUDIO_SOURCE_TYPE == 2) {
        try {
            // omitted permission check and request
            // need manually turn on AUDIO RECORDING PERMISSION to run this code
            audioRec = new AudioRecord(MediaRecorder.AudioSource.DEFAULT, 22050, 1, AudioFormat.ENCODING_PCM_16BIT, 524288);
        } catch (IllegalArgumentException e) {
            Log.e("audioRec", "IllegalArgument");
        } catch (SecurityException e) {
            Log.e("audioRec", "SecurityException!");
        } catch (Exception e) {
            Log.e("audioRec", "any Exception");
        }

        m_bTerminateThread = false;

        audioRec.startRecording();
        m_hAutoRecordThread = new Thread(new RecordingRunnable(), "RecordingThread");
        m_hAutoRecordThread.start();
    }

    final Intent speechRecognizerIntent = createSpeechRecognizerIntent();
    speechRecognizer.startListening(speechRecognizerIntent);

    if (AUDIO_SOURCE_TYPE == 2) {
        new Timer().schedule(
                new TimerTask() {

                    @Override
                    public void run() {

                        stopRecognizer();
                    }
                }, 5000);
    }
}
1
threewire On

I came across google's could speech API which can take a raw audio file as input and perform asynchronous speech recognition. I have limited app development experience and with java. https://cloud.google.com/speech/docs/async-recognize This link shows how to and here is some elongated source code https://github.com/GoogleCloudPlatform/java-docs-samples/blob/master/speech/cloud-client/src/main/java/com/example/speech/QuickstartSample.java. But problem is when I added the following import statements to my application code in android studio mainactivity.java the get greyed out and some are marked in red.

import com.google.cloud.speech.v1.RecognitionAudio;
import com.google.cloud.speech.v1.RecognitionConfig;
import com.google.cloud.speech.v1.RecognitionConfig.AudioEncoding;
import com.google.cloud.speech.v1.RecognizeResponse;
import com.google.cloud.speech.v1.SpeechClient;
import com.google.cloud.speech.v1.SpeechRecognitionAlternative;
import com.google.cloud.speech.v1.SpeechRecognitionResult;
import com.google.protobuf.ByteString;

import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.List;