ResourceExhaustedError In Tensorflow BERT Classifier

17 views Asked by At

I am trying to use the BertClassifier from the keras_nlp library but when I train the model I get this error:

2024-03-22 22:53:03.932926: W external/local_tsl/tsl/framework/bfc_allocator.cc:487] Allocator (GPU_0_bfc) ran out of memory trying to allocate 192.00MiB (rounded to 201326592)requested by op StatelessRandomUniformV2
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2024-03-22 22:53:03.933015: I external/local_tsl/tsl/framework/bfc_allocator.cc:1044] BFCAllocator dump for GPU_0_bfc
2024-03-22 22:53:03.933046: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (256):    Total Chunks: 74, Chunks in use: 74. 18.5KiB allocated for chunks. 18.5KiB in use in bin. 549B client-requested in use in bin.
2024-03-22 22:53:03.933066: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (512):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-22 22:53:03.933086: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (1024):   Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2024-03-22 22:53:03.933105: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (2048):   Total Chunks: 111, Chunks in use: 111. 333.2KiB allocated for chunks. 333.2KiB in use in bin. 333.0KiB client-requested in use in bin.
2024-03-22 22:53:03.933121: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (4096):   Total Chunks: 1, Chunks in use: 1. 6.0KiB allocated for chunks. 6.0KiB in use in bin. 6.0KiB client-requested in use in bin.
...
3.30GiB allocated for chunks. 3.30GiB in use in bin. 3.19GiB client-requested in use in bin.
2024-03-22 22:53:03.933449: I external/local_tsl/tsl/framework/bfc_allocator.cc:1067] Bin for 192.00MiB was 128.00MiB, Chunk State: 
2024-03-22 22:53:03.933468: I external/local_tsl/tsl/framework/bfc_allocator.cc:1080] Next region of size 10344333312
2024-03-22 22:53:03.933495: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 78f6b2000000 of size 256 next 1
2024-03-22 22:53:03.933513: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 78f6b2000100 of size 1280 next 2
2024-03-22 22:53:03.933533: I external/local_tsl/tsl/framework/bfc_allocator.cc:1100] InUse at 78f6b2000600 of size 256 next 3
...
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
Cell In[38], line 1
----> 1 classifer.fit(X_train, y_train, epochs=1, batch_size=128)

File /usr/local/lib/python3.11/dist-packages/keras_nlp/src/utils/pipeline_model.py:188, in PipelineModel.fit(self, x, y, batch_size, sample_weight, validation_data, validation_split, **kwargs)
    181         (vx, vy, vsw) = keras.utils.unpack_x_y_sample_weight(
    182             validation_data
    183         )
    184         validation_data = _convert_inputs_to_dataset(
    185             vx, vy, vsw, batch_size
    186         )
--> 188 return super().fit(
    189     x=x,
    190     y=None,
    191     batch_size=None,
    192     sample_weight=None,
    193     validation_data=validation_data,
    194     **kwargs,
    195 )

File /usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py:123, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    120     filtered_tb = _process_traceback_frames(e.__traceback__)
    121     # To get the full stack trace, call:
    122     # `keras.config.disable_traceback_filtering()`
--> 123     raise e.with_traceback(filtered_tb) from None
    124 finally:
    125     del filtered_tb

File /usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py:123, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    120     filtered_tb = _process_traceback_frames(e.__traceback__)
    121     # To get the full stack trace, call:
    122     # `keras.config.disable_traceback_filtering()`
--> 123     raise e.with_traceback(filtered_tb) from None
    124 finally:
    125     del filtered_tb

ResourceExhaustedError: Exception encountered when calling Dropout.call().

{{function_node __wrapped__StatelessRandomUniformV2_device_/job:localhost/replica:0/task:0/device:GPU:0}} OOM when allocating tensor with shape[128,512,768] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:StatelessRandomUniformV2] name: 

Arguments received by Dropout.call():
  • inputs=tf.Tensor(shape=(128, 512, 768), dtype=float32)
  • training=True

Here is the code I am running:

import keras_nlp

classifer = keras_nlp.models.BertClassifier.from_preset(
    'bert_base_en_uncased',
    num_classes=num_classes,
    activation='softmax',
)

classifer.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

import os
os.environ['TF_GPU_ALLOCATOR'] = 'cuda_malloc_async'

classifer.fit(X_train, y_train, epochs=1, batch_size=128)

I am using a docker container (tensorflow:latest-gpu-jupyter) and I have a Nvidia RTX 3060 and a Core i5 CPU with all the drivers updated.

I have tried adding the environment variable but that didn't work. I have tried reducing the batch size even all the way down to a batch size of 1.

Please answer if you have a solution or a suggestion. Anything will be appreciated.

0

There are 0 answers