Tensorflow: convert PrefetchDataset to BatchDataset

1.4k views Asked by At

Tensorflow: convert PrefetchDataset to BatchDataset

With latest Tensorflow version 2.3.1I am trying to follow basic text classification example at: https://www.tensorflow.org/tutorials/keras/text_classification. Instead of creating dataset from directory as example does, I am using a csv file:

SELECT_COLUMNS = ['SentimentText','Sentiment']
LABEL_COLUMN = 'Sentiment'
LABELS = [0, 1]

def get_dataset(file_path, **kwargs):
    dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=3, # Artificially small to make examples easier to show.
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True, 
      **kwargs)
    return dataset

all_data = get_dataset(data_path, select_columns=SELECT_COLUMNS)

As a result I get:

type(all_data)
tensorflow.python.data.ops.dataset_ops.PrefetchDataset

Example loads data from directory with:

batch_size = 32
seed = 42

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', 
    batch_size=batch_size, 
    validation_split=0.2, 
    subset='training', 
    seed=seed)

And gets dataset of another type:

type(raw_train_ds)
tensorflow.python.data.ops.dataset_ops.BatchDataset

Now when I try to standardise and vectorise data with functions from example:

def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
    return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation),
                                  '')

max_features = 10000
sequence_length = 250

vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)

And apply them to my dataset I get error:

# Make a text-only dataset (without labels), then call adapt
train_text = all_data.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-20-1f1fc445912d> in <module>
      1 # Make a text-only dataset (without labels), then call adapt
      2 train_text = all_data.map(lambda x, y: x)
----> 3 vectorize_layer.adapt(train_text)

/opt/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/layers/preprocessing/text_vectorization.py in adapt(self, data, reset_state)
    378       shape = dataset_ops.get_legacy_output_shapes(data)
    379       if not isinstance(shape, tensor_shape.TensorShape):
--> 380         raise ValueError("The dataset passed to 'adapt' must contain a single "
    381                          "tensor value.")
    382       if shape.rank == 0:

ValueError: The dataset passed to 'adapt' must contain a single tensor value.

How to convert PrefetchDataset to BatchDataset ?

1

There are 1 answers

0
anthidp On BEST ANSWER

You could use tf.stack method to pack the features into a single array. The below function is from Custom training: walkthrough in Tensorflow.

def pack_features_vector(features, labels):
  features = tf.stack(list(features.values()), axis=1)
  return features, labels

all_data = get_dataset(data_path, select_columns=SELECT_COLUMNS)

train_dataset = all_data.map(pack_features_vector)

train_text = train_dataset.map(lambda x, y: x)

vectorize_layer.adapt(train_text)