Keras crashes when calling model.fit with GPU with large-ish datasets, without giving Out of memory however

1.1k views Asked by At

I'm working using Google Colab, TensorFlow 2.3.0 and tf.keras

I need to run a simple 3D model, that however gets in input a relatively large dataset (batches of 128x128x64x4 images).

If I try to run this on CPU I notice that when I call model.fit() I see the RAM usage increasing a huge lot before going back down to the amount I expect (e.g.: 1.5 GB used before calling "fit", increasing up to 4 GB, then going back down to 2.3 GB when the actual fit starts).

If I run it on GPU, I don't get GPU OOM, like I get in normal models when I use batch sizes that are too large, but the session crashes. I've written a minimal working example that gives this error, despite using a very simple network (20k parameters) and a small input dataset. Such a network has only 20k trainable parameters, therefore very small gradients in memory. The input data also are only ~16MB each! Yet, it crashes if you set batch_size larger than 2!

Any help? The log doesn't give any information I can interpret easily...

import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input,Conv3D, Activation, BatchNormalization, MaxPooling3D, Flatten,Dense,Add, GlobalMaxPooling3D
from tensorflow.keras import optimizers

matrixSide = 128 #dimension of input data
zLeng = 64


#create a very simple model
inputL = Input([matrixSide,matrixSide,zLeng,4])
l1 = Conv3D(32,3,activation='relu',padding='same') (inputL) #120
l1 = BatchNormalization(momentum=0.9)(l1)
l1 = Conv3D(8,1,activation='relu')(l1)
l1 = BatchNormalization(momentum=0.9)(l1)
l1 = Conv3D(64,3,activation='relu',padding='same')(l1)
l1 = Conv3D(1,1,padding='same')(l1)
l1 = Activation('linear')(l1)
modelSeg = Model(inputs= inputL,outputs = l1)

Xtrain = np.random.normal(0,1,size=(32,matrixSide,matrixSide,zLeng,4)).astype('float32')
Ytrain = Xtrain[:,:,:,:,0]

optim = optimizers.Adam(lr=1e-3)
modelSeg.compile(optimizer=optim,loss='mse')
fitHist = modelSeg.fit(Xtrain,Ytrain,batch_size=4,epochs=4) #works with bach_size=2 or 1
#crashes with batch_size>=4

edit: Adding the trace log:

Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.406789: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2200000000 Hz
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.350162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.349428: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.348349: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.552549: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Sep 28, 2020, 10:59:23 PM   WARNING coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.73GiB deviceMemoryBandwidth: 298.08GiB/s
Sep 28, 2020, 10:59:23 PM   WARNING pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.552460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.551524: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.550156: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla T4, Compute Capability 7.5
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.550111: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2bf0f40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.548878: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.407463: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.407435: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2bf0a00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.558673: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.555209: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.554467: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.553346: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.552924: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.552872: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.552668: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.552642: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.552619: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
Sep 28, 2020, 10:59:23 PM   WARNING 2020-09-28 20:59:23.552595: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
Sep 28, 2020, 10:59:39 PM   WARNING 2020-09-28 20:59:39.498888: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
Sep 28, 2020, 10:59:34 PM   WARNING 2020-09-28 20:59:34.398056: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
Sep 28, 2020, 10:59:27 PM   WARNING tcmalloc: large alloc 1073741824 bytes == 0x23820000 @ 0x7ff2698ab1e7 0x7ff260da55e1 0x7ff260e09c78 0x7ff260e09f37 0x7ff260ea1f28 0x567193 0x7ff25b2ff939 0x7ff25b2fc877 0x7ff25b2ffacb 0x7ff25b3221e0 0x7ff25b77518d 0x50a7f5 0x50cfd6 0x507f24 0x5165a5 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x507f24 0x509202 0x594b01 0x59fe1e 0x50d596 0x507f24
Sep 28, 2020, 10:59:27 PM   WARNING 2020-09-28 20:59:27.334226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13962 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
Sep 28, 2020, 10:59:27 PM   WARNING 2020-09-28 20:59:27.334170: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
Sep 28, 2020, 10:59:27 PM   WARNING 2020-09-28 20:59:27.333295: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Sep 28, 2020, 10:59:27 PM   WARNING 2020-09-28 20:59:27.332306: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Sep 28, 2020, 10:59:27 PM   WARNING 2020-09-28 20:59:27.329050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
Sep 28, 2020, 10:59:27 PM   WARNING 2020-09-28 20:59:27.329037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
Sep 28, 2020, 10:59:27 PM   WARNING 2020-09-28 20:59:27.328977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
Sep 28, 2020, 10:59:44 PM   WARNING WARNING:root:kernel 3096a19f-858d-47c6-81aa-4654035a59f2 restarted
Sep 28, 2020, 10:59:44 PM   INFO    KernelRestarter: restarting kernel (1/5), keep random ports
Sep 28, 2020, 10:59:41 PM   WARNING 2020-09-28 20:59:41.705510: F tensorflow/stream_executor/gpu/gpu_timer.cc:65] Check failed: start_event_ != nullptr && stop_event_ != nullptr
Sep 28, 2020, 10:59:41 PM   WARNING 2020-09-28 20:59:41.705499: I tensorflow/stream_executor/stream.cc:1990] [stream=0x22f98140,impl=0x226289c0] did not enqueue 'stop timer': 0x7ff1af7fced0
Sep 28, 2020, 10:59:41 PM   WARNING 2020-09-28 20:59:41.705470: I tensorflow/stream_executor/stream.cc:1978] [stream=0x22f98140,impl=0x226289c0] did not enqueue 'start timer': 0x7ff1af7fced0
Sep 28, 2020, 10:59:41 PM   WARNING 2020-09-28 20:59:41.705460: I tensorflow/stream_executor/stream.cc:322] did not allocate timer: 0x7ff1af7fced0
Sep 28, 2020, 10:59:41 PM   WARNING 2020-09-28 20:59:41.705443: I tensorflow/stream_executor/stream.cc:4977] [stream=0x22f98140,impl=0x226289c0] did not memzero GPU location; source: 0x7ff1af7fcee0
Sep 28, 2020, 10:59:41 PM   WARNING 2020-09-28 20:59:41.705377: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at conv_grad_ops_3d.cc:2013 : Not found: No algorithm worked!
Sep 28, 2020, 10:59:41 PM   WARNING in tensorflow/stream_executor/cuda/cuda_dnn.cc(3316): 'cudnnConvolutionBackwardData( cudnn.handle(), alpha, filter_nd.handle(), filter_data.opaque(), output_nd.handle(), output_data.opaque(), conv.handle(), ToConvBackwardDataAlgo(algorithm_desc), scratch_memory.opaque(), scratch_memory.size(), beta, input_nd.handle(), input_data.opaque())'
Sep 28, 2020, 10:59:41 PM   WARNING 2020-09-28 20:59:41.705133: E tensorflow/stream_executor/dnn.cc:616] CUDNN_STATUS_EXECUTION_FAILED

0

There are 0 answers