The dataset I am working with is TFDS Cats vs Dogs, to improve the model accuracy, I used data augmentation to concatenated the augmented dataset into the original training dataset.
train_ds, test_ds = tfds.load('cats_vs_dogs', split = ['train[:80%]', 'train[80%:]'], as_supervised = True)
data_augmentation = keras.Sequential([
keras.layers.RandomHeight(0.1),
keras.layers.RandomWidth(0.1),
keras.layers.RandomFlip('horizontal')
])
@tf.function
def scale_resize_image(image, label):
image = tf.image.resize(image, (200,200))/255.0
return (image, label)
@tf.function
def data_aug(image, label):
return (data_augmentation(image), label)
aug_train_ds = (train_ds.map(data_aug))
aug_test_ds = (test_ds.map(data_aug))
aug_train_ds = (aug_train_ds.map(scale_resize_image))
aug_test_ds = (aug_test_ds.map(scale_resize_image))
train_ds = (train_ds.map(scale_resize_image))
test_ds = (test_ds.map(scale_resize_image))
train_ds = train_ds.concatenate(aug_train_ds)
test_ds = test_ds.concatenate(aug_test_ds)
AUTO = tf.data.AUTOTUNE
train_ds = train_ds.cache()
train_ds = train_ds.shuffle(10000)
train_ds = train_ds.batch(64)
train_ds = train_ds.prefetch(AUTO)
test_ds = test_ds.batch(64)
test_ds = test_ds.cache()
test_ds = test_ds.prefetch(AUTO)
All the image is resized into (200,200)
model = keras.models.Sequential([
keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(200, 200, 3)),
keras.layers.MaxPooling2D(2, 2),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.MaxPooling2D(2, 2),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.MaxPooling2D(2, 2),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.MaxPooling2D(2, 2),
keras.layers.Flatten(),
keras.layers.Dense(512, activation='relu'),
keras.layers.BatchNormalization(),
keras.layers.Dense(512, activation='relu'),
keras.layers.Dropout(0.1),
keras.layers.BatchNormalization(),
keras.layers.Dense(512, activation='relu'),
keras.layers.Dropout(0.2),
keras.layers.BatchNormalization(),
keras.layers.Dense(1, activation='sigmoid')
])
model.compile(
optimizer = 'adam',
loss = keras.losses.BinaryCrossentropy(),
metrics = ['accuracy']
)
The model is here incase anyone want to replicate the error I encountered
model.fit(train_ds, validation_data = test_ds, epochs = 20)
Then I fit my train_ds to the model, but just after fitting 169 batches out of 582, the RAM of colab filled very fast. Looking at the resources on colab, when the train_ds has just finished shuffling, batching and prefetching, the RAM is only 7.6/12.7GB, but then when the training started, the RAM jumped from 7.6 to 10.9GB, moments after, colab session crashes.
I have tried to change the buffer_size of shuffling to 1000, but the system still crashed after inputting 360/582 batches of data. I believe that the system can handle the shuffling buffer_size of 10000 since the RAM is only 7.6GB after shuffling. I just don't understand why the RAM increase so much during training, is there a mistake in my code? Or is there a way to cache the training batches of data after it is inputted into the model.
The runtime logs of colab is below
Feb 17, 2024, 10:35:10 AM WARNING WARNING:root:kernel 15442701-30d0-4330-adea-dda55425f269 restarted
Feb 17, 2024, 10:35:10 AM INFO KernelRestarter: restarting kernel (1/5), keep random ports
Feb 17, 2024, 10:34:35 AM WARNING I0000 00:00:1708137275.262788 4991 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
Feb 17, 2024, 10:34:35 AM WARNING WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
Feb 17, 2024, 10:34:35 AM WARNING 2024-02-17 02:34:35.081162: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
Feb 17, 2024, 10:34:35 AM WARNING 2024-02-17 02:34:35.034986: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (0): Tesla T4, Compute Capability 7.5
Feb 17, 2024, 10:34:35 AM WARNING 2024-02-17 02:34:35.034943: I external/local_xla/xla/service/service.cc:168] XLA service 0x7be74d6eb3f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
Feb 17, 2024, 10:34:30 AM WARNING 2024-02-17 02:34:30.050414: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
Feb 17, 2024, 10:34:29 AM WARNING 2024-02-17 02:34:29.741309: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:452] Shuffle buffer filled.
Feb 17, 2024, 10:34:26 AM WARNING 2024-02-17 02:34:26.245341: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] ShuffleDatasetV3:23: Filling up shuffle buffer (this may take a while): 7487 of 10000