Example pipeline for TFRecords with chunking for long input sequences

765 views Asked by At

I'm trying to optimise the input pipeline for a model I am using that uses GRUs. The data consists of a large number of files that contain time series of length 5000 with dimensionality of 50. I know that it isn't feasible to feed a single sequence of length 5000 into an RNN owing to the vanishing gradient, and you should instead try to chunk it into (5000-seq_len) overlapping chunks, where seq_len is a more manageable length, say 200 timesteps.

The most obvious method for getting this to work with TFRecords/SequenceExamples is to simply have each chunk included as a new SequenceExample within the same file. This seems massively inefficient however, as the majority of data in the resulting TFRecords file will be duplicate data.

Is there a better method of doing this? I've seen very few examples of how to use TFRecords that don't involve images, and no examples that use non-trivial sequence lengths!

For example:

def chunk_save_tfrecords(X, file_path_prefix, seq_length):

    # Generate tfrecord writer
    result_tf_file = file_path_prefix + '.tfrecords' 
    with tf.python_io.TFRecordWriter(result_tf_file) as writer:

        # Chunk the data
        for i in range(int(X.shape[0] - seq_length)):
            chunk = X[i:i+seq_length]

            data_features = [
                tf.train.Feature(
                    float_list=tf.train.FloatList(value=chunk[t])) 
                                for t in range(seq_length)] # FloatList per timestep
            feature_lists = tf.train.FeatureLists(
                feature_list={
                    'data': tf.train.FeatureList(feature=data_features)})

            serialized = tf.train.SequenceExample(
                feature_lists=feature_lists).SerializeToString()
            writer.write(serialized)

def save_tfrecords(X, file_path_prefix):

    # Generate tfrecord writer
    result_tf_file = file_path_prefix + '.tfrecords' 
    with tf.python_io.TFRecordWriter(result_tf_file) as writer:

        data_features = [
            tf.train.Feature(
                float_list=tf.train.FloatList(value=X[t])) 
                            for t in range(X.shape[0])] # FloatList per timestep
        feature_lists = tf.train.FeatureLists(
            feature_list={
                'data': tf.train.FeatureList(feature=data_features)})

        serialized = tf.train.SequenceExample(
            feature_lists=feature_lists).SerializeToString()
        writer.write(serialized)

test = np.random.randn(5000,50)
save_tfrecords(test, 'test')
chunk_save_tfrecords(test, 'test_chunk', 200)

save_tfrecords creates a 1MB file, while chunk_save_tfrecords creates a 200MB file!

0

There are 0 answers