I am getting started with Google's Audioset. While the dataset is extensive, I find the information with regards to the audio feature extraction very vague. The website mentions
128-dimensional audio features extracted at 1Hz. The audio features were extracted using a VGG-inspired acoustic model described in Hershey et. al., trained on a preliminary version of YouTube-8M. The features were PCA-ed and quantized to be compatible with the audio features provided with YouTube-8M. They are stored as TensorFlow Record files.
Within the paper, the authors discuss using mel spectrograms on 960 ms chunks to get a 96x64 representation. It is then unclear to me how they get to the 1x128 format representation used in the Audioset. Does anyone know more about this??
They use the
96*64data as input for a modifiedVGGnetwork.The last layer ofVGGisFC-128, so its output will be1*128, and that is the reason.The architecture of
VGGcan be found here: https://github.com/tensorflow/models/blob/master/research/audioset/vggish_slim.py