Colab RAM filling up moments after training start

52 views Asked by At

The dataset I am working with is TFDS Cats vs Dogs, to improve the model accuracy, I used data augmentation to concatenated the augmented dataset into the original training dataset.

train_ds, test_ds = tfds.load('cats_vs_dogs', split = ['train[:80%]', 'train[80%:]'], as_supervised = True)

data_augmentation = keras.Sequential([
    keras.layers.RandomHeight(0.1),
    keras.layers.RandomWidth(0.1),
    keras.layers.RandomFlip('horizontal')
])


@tf.function
def scale_resize_image(image, label):
    image = tf.image.resize(image, (200,200))/255.0
    return (image, label)


@tf.function
def data_aug(image, label):
    return (data_augmentation(image), label)


aug_train_ds = (train_ds.map(data_aug))
aug_test_ds = (test_ds.map(data_aug))

aug_train_ds = (aug_train_ds.map(scale_resize_image))
aug_test_ds = (aug_test_ds.map(scale_resize_image))

train_ds = (train_ds.map(scale_resize_image))
test_ds = (test_ds.map(scale_resize_image))

train_ds = train_ds.concatenate(aug_train_ds)
test_ds = test_ds.concatenate(aug_test_ds)

AUTO = tf.data.AUTOTUNE

train_ds = train_ds.cache()
train_ds = train_ds.shuffle(10000)
train_ds = train_ds.batch(64)
train_ds = train_ds.prefetch(AUTO)

test_ds = test_ds.batch(64)
test_ds = test_ds.cache()
test_ds = test_ds.prefetch(AUTO)

All the image is resized into (200,200)

model = keras.models.Sequential([
    keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(200, 200, 3)),
    keras.layers.MaxPooling2D(2, 2),
    keras.layers.Conv2D(64, (3, 3), activation='relu'),
    keras.layers.MaxPooling2D(2, 2),
    keras.layers.Conv2D(64, (3, 3), activation='relu'),
    keras.layers.MaxPooling2D(2, 2),
    keras.layers.Conv2D(64, (3, 3), activation='relu'),
    keras.layers.MaxPooling2D(2, 2),
 
    keras.layers.Flatten(),
    keras.layers.Dense(512, activation='relu'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(512, activation='relu'),
    keras.layers.Dropout(0.1),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(512, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(1, activation='sigmoid')
])

model.compile(
    optimizer = 'adam',
    loss = keras.losses.BinaryCrossentropy(),
    metrics = ['accuracy']
)

The model is here incase anyone want to replicate the error I encountered

model.fit(train_ds, validation_data = test_ds, epochs = 20)

Then I fit my train_ds to the model, but just after fitting 169 batches out of 582, the RAM of colab filled very fast. Looking at the resources on colab, when the train_ds has just finished shuffling, batching and prefetching, the RAM is only 7.6/12.7GB, but then when the training started, the RAM jumped from 7.6 to 10.9GB, moments after, colab session crashes.

I have tried to change the buffer_size of shuffling to 1000, but the system still crashed after inputting 360/582 batches of data. I believe that the system can handle the shuffling buffer_size of 10000 since the RAM is only 7.6GB after shuffling. I just don't understand why the RAM increase so much during training, is there a mistake in my code? Or is there a way to cache the training batches of data after it is inputted into the model.

The runtime logs of colab is below

Feb 17, 2024, 10:35:10 AM   WARNING WARNING:root:kernel 15442701-30d0-4330-adea-dda55425f269 restarted
Feb 17, 2024, 10:35:10 AM   INFO    KernelRestarter: restarting kernel (1/5), keep random ports
Feb 17, 2024, 10:34:35 AM   WARNING I0000 00:00:1708137275.262788 4991 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
Feb 17, 2024, 10:34:35 AM   WARNING WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
Feb 17, 2024, 10:34:35 AM   WARNING 2024-02-17 02:34:35.081162: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
Feb 17, 2024, 10:34:35 AM   WARNING 2024-02-17 02:34:35.034986: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (0): Tesla T4, Compute Capability 7.5
Feb 17, 2024, 10:34:35 AM   WARNING 2024-02-17 02:34:35.034943: I external/local_xla/xla/service/service.cc:168] XLA service 0x7be74d6eb3f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
Feb 17, 2024, 10:34:30 AM   WARNING 2024-02-17 02:34:30.050414: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
Feb 17, 2024, 10:34:29 AM   WARNING 2024-02-17 02:34:29.741309: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:452] Shuffle buffer filled.
Feb 17, 2024, 10:34:26 AM   WARNING 2024-02-17 02:34:26.245341: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] ShuffleDatasetV3:23: Filling up shuffle buffer (this may take a while): 7487 of 10000
0

There are 0 answers