I am getting started with Google's Audioset. While the dataset is extensive, I find the information with regards to the audio feature extraction very vague. The website mentions
128-dimensional audio features extracted at 1Hz. The audio features were extracted using a VGG-inspired acoustic model described in Hershey et. al., trained on a preliminary version of YouTube-8M. The features were PCA-ed and quantized to be compatible with the audio features provided with YouTube-8M. They are stored as TensorFlow Record files.
Within the paper, the authors discuss using mel spectrograms on 960 ms chunks to get a 96x64 representation. It is then unclear to me how they get to the 1x128 format representation used in the Audioset. Does anyone know more about this??
They use the
96*64
data as input for a modifiedVGG
network.The last layer ofVGG
isFC-128
, so its output will be1*128
, and that is the reason.The architecture of
VGG
can be found here: https://github.com/tensorflow/models/blob/master/research/audioset/vggish_slim.py