I am doing a personal project for educational purpose to learn Keras and machine learning. For start, I would like to classify if a sound is a clap or stomp.

I am using a microcontroller that is sound triggered and samples sound @ 20usec. And the microcontroller will send this raw ADC data to the PC for Python processing. I am currently taking 1000 points and get the FFT using numpy (using rfft and getting its absolute value).

Now, I would like to feed the captured FFT signals for clap or stomp as a training data to classify them using neural network. I had been researching for the whole day regarding this and some articles say the Convolutional Neural Network should be used and some say Recurrent Neural Network should be used.

I looked at Convolutional Neural Network and it raised another question, if I should be using Keras' 1-D or 2-D Conv.

2 Answers

Shubham Panchal On Best Solutions

You need to process the FFT signals to classify whether the sound is a clap or a stomp.

For Convolutional Neural Networks ( CNN):

CNNs can extract features from fixed length inputs. 1D CNNs with Max-Pooling work the best on signal data ( I have personally used them over accelerometer data).

You can use them if your input is fixed length and has significant features.

For Recurrent Neural Networks:

Should be used when the signal has a temporal feature.

Temporal features ( for example ) could be thought in this way for the recognition of a clap. A clap has immediate high-raised sound followed by a soft sound ( when the clap ends ). An RNN will learn these two features ( mentioned above ) in a sequence. And also clapping is a sequential action ( it consists of various activities in sequence ).

RNNs and LSTMs can be the best choice if they receive excellent features.

An hybrid Conv LSTM:

This NN is a hybrid of CNN and LSTMs ( RNN ). They use CNNs for feature extraction and then this sequence is learned by LSTMs. The features extracted by the CNNs also contain temporal features.

This could be super easy if you are using Keras.


As audio classification is performed, I will also suggest the use of MFCC to extract features.

I think you should try all the 3 approaches and see which suits the best. Most probably RNNs and ConvLSTMs will work for your use case.

Hope it helps.

prabindh On

Since the train/test system is not an embedded system in this case, do take a look at VGGish (https://github.com/tensorflow/models/tree/master/research/audioset - also refers to the paper and dataset including clapping), that uses the below to compute a set of features:

VGGish was trained with audio features computed as follows:

  • All audio is resampled to 16 kHz mono.
  • A spectrogram is computed using magnitudes of the Short-Time Fourier Transform with a window size of 25 ms, a window hop of 10 ms, and a periodic Hann window.
  • A mel spectrogram is computed by mapping the spectrogram to 64 mel bins covering the range 125-7500 Hz.
  • A stabilized log mel spectrogram is computed by applying log(mel-spectrum + 0.01) where the offset is used to avoid taking a logarithm of zero.
  • These features are then framed into non-overlapping examples of 0.96 seconds, where each example covers 64 mel bands and 96 frames of 10 ms each.

Note - clapping is already covered (https://research.google.com/audioset/dataset/clapping.html)