How to create mini-batches using tensorflow.data.experimental.CsvDataset compatible with model's input shape?

848 views Asked by At

I'm going to train mini-batch by using tensorflow.data.experimental.CsvDataset in TensorFlow 2. But Tensor's shape doesn't fit to my model's input shape.

Please let me know what is the best way to mini-batch training by a dataset of TensorFlow.

I tried as follows:

# I have a dataset with 4 features and 1 label
feature = tf.data.experimental.CsvDataset(['C:/data/iris_0.csv'], record_defaults=[.0] * 4, header=True, select_cols=[0,1,2,3])
label = tf.data.experimental.CsvDataset(['C:/data/iris_0.csv'], record_defaults=[.0] * 1, header=True, select_cols=[4])
dataset = tf.data.Dataset.zip((feature, label))

# and I try to minibatch training:
model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(4,))])
model.compile(loss='mse', optimizer='sgd')
model.fit(dataset.repeat(1).batch(3), epochs=1)

I got an error:

ValueError: Error when checking input: expected dense_6_input to have shape (4,) but got array with shape (1,)

Because of : CsvDataset() returns the a tensor of shape (features, batch), but I need it to be of shape (batch, features).

Reference code:

for feature, label in dataset.repeat(1).batch(3).take(1):
    print(feature)

# (<tf.Tensor: id=487, shape=(3,), dtype=float32, numpy=array([5.1, 4.9, 4.7], dtype=float32)>, <tf.Tensor: id=488, shape=(3,), dtype=float32, numpy=array([3.5, 3. , 3.2], dtype=float32)>, <tf.Tensor: id=489, shape=(3,), dtype=float32, numpy=array([1.4, 1.4, 1.3], dtype=float32)>, <tf.Tensor: id=490, shape=(3,), dtype=float32, numpy=array([0.2, 0.2, 0.2], dtype=float32)>)
1

There are 1 answers

1
today On BEST ANSWER

The tf.data.experimental.CsvDataset creates a dataset where each element of the dataset correponds to a row in the CSV file and consists of multiple tensors, i.e. a separate tensor for each column. Therefore, first you need to use map method of dataset to stack all of these tensors into a single tensor so as it would be compatible with the input shape expected by the model:

def map_func(features, label):
    return tf.stack(features, axis=1), tf.stack(label, axis=1)

dataset = dataset.map(map_func).batch(BATCH_SIZE)