how not mix csv data in a sample using tf.data.experimental.CsvDataset

18 views Asked by At

I have trouble to manage windows sliding and tf.data.experimental.CsvDataset in a way to get the perfect LSTM dataset.

My goal is to create timeseries dataset for LSTM base on many csv files.

Each csv file = data of one specific site. For this reason I need to ensure that there is no mix of data from more than one site (csv) by sample.

My code is working end to end except for mixing data inside sample:

def load_my_csv(path):
    return tf.data.experimental.CsvDataset(
                path,
                record_defaults=[tf.int32]*6,
                header=True,
                field_delim=',',
                select_cols=[1,2,3,4,5,6]
    )
# list all csv files and load it
list_csv=tf.data.Dataset.list_files(test_samples + '*.csv',shuffle=False)
dataset = list_csv.interleave(load_my_csv, cycle_length=1)

# turn my data to timeseries sequence data
ds = ds.map(lambda *items: tf.stack(items))
ds = ds.window(input_sequence_length + output_sequence_length, shift=1, drop_remainder=True)
ds = ds.flat_map(lambda x: x).batch(input_sequence_length + output_sequence_length)

print_dataset(ds)

base on these dummy data,

x1,x2,x3,x4,y,yy,yyy
1,11,111,11111,-1,-11,-111
2,22,222,2222,-2,-22,-222
3,33,333,3333,-3,-33,-333
4,44,444,4444,-4,-44,-444
5,55,555,5555,-5,-55,-555
6,66,666,6666,-6,-66,-666
7,77,777,7777,-7,-77,-777
8,88,888,8888,-8,-88,-888
9,99,999,9999,-9,-99,-999

my code process data correctly like this sample:

array([[[  22.,  222., 2222.],
        [  33.,  333., 3333.],
        [  44.,  444., 4444.],
        [  55.,  555., 5555.],
        [  66.,  666., 6666.]]]

but in some samples I get a mix of csv (end of one and the start of the next one like this:

array([[[  88.,  888., 8888.],
        [  99.,  999., 9999.],
        [  11.,  111., 1111.],
        [  22.,  222., 2222.],
        [  33.,  333., 3333.]]]
       

My understand is dataset map all my .csv files in a continuous way and make the windows on it, without knowing where csv start and end.

How to ensure no mix of 2 csv in the same sample? I'm new with data.dataset and pass long time to build this working code, but I'm realy stuck on this point.

0

There are 0 answers