Reading AudioSet data I not all record seems to be present

21 views Asked by At

I am reading data from AudioSet dataset and I find less record than there are supposed to be. The data are save as TFrecord. To read the Audioset tfrecord and select a thunder sound (label= 287) I have written the following function.


import os
import tensorflow as tf
print(tf.__version__)

def filter_for_thunder(tfrecord_folder, new_file_path, value_to_check):
    """Parse trough a tfrecord file and search for entries related to one specific label, for example thunder (287),
    when found, append the whole entrie to a new file.

    Args:
        tfrecord_folder (str): path of the original TFRecord file
        new_file_path (str): path to the newly created file containing only the selected entries
        value_to_check (int): value of the label to use as filter 
        entries_list (empty list): an empty list to append matching entries to 
    """   
    
    #create a list with all tfrecord files 
    tfrecord_files = [os.path.join(tfrecord_folder, f) for f in os.listdir(tfrecord_folder) if      f.endswith('.tfrecord')]
    entries_list = [] #empty list to append matching records
    label_counter = 0 #counter to count the number of matching labels
    all_counter = 0
    
    for file in tfrecord_files:
          
        for raw_record in tf.data.TFRecordDataset(file):
            example = tf.train.Example()
            example.ParseFromString(raw_record.numpy())
            all_counter += 1

            if value_to_check in example.features.feature['labels'].int64_list.value:
                # If the value is found, write the example to the new TFRecord file
                entries_list.append(raw_record.numpy())                
                label_counter += 1
                #print(label_counter, raw_record.numpy())
                print(example.features.feature['labels'].int64_list.value, 'picked')
            else:
                print(example.features.feature['labels'].int64_list.value, 'not picked')
                
    
    print('I found', label_counter, ' records for label', value_to_check, 'on ', all_counter, ' total record')
    return entries_list  

tfrecord_folder = '' #put the path to your tfrecord files here
new_filename = 'thunder_bal_training.tfrecord'

entries_list, writer = filter_for_thunder(tfrecord_folder, new_filename, value_to_check=287)                

Inside the AudioSet there are three subfolder (balance training, unbalance training, and evaluation) for each of the folder the function finds less record than there are supposed to be (this is shown in the final print). For example for the unbalance training folder the function finds 419 thunder sounds instead of 1163. The problem is not in the selection of the particular sound (the if in the function), the function does not read all the record but only 777649 out of more then 2,042,985 there should be according to the AudioSet page. Any idea why this happens?

I tried to read different label and the problem seem to be in the reading of the record before the filtering for a particular label. I would like my function to open all the records

0

There are 0 answers