Tensorflow/WSL2 GPU out of memory, not using all available?

982 views Asked by At

So i'm trying to finetune the medium model on a TITAN RTX (24G) in WSL2 but it seems to run out of memory? the small model fits. If i boot my computer on a live ubuntu I can train the medium and large model with on issues.

2020-09-23 13:19:36.310992: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23b7a0000 next 260 of size 4194304
2020-09-23 13:19:36.310995: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23bba0000 next 266 of size 16777216
2020-09-23 13:19:36.310998: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23cba0000 next 268 of size 16777216
2020-09-23 13:19:36.311001: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23dba0000 next 270 of size 12582912
2020-09-23 13:19:36.311004: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23e7a0000 next 272 of size 4194304
2020-09-23 13:19:36.311006: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23eba0000 next 278 of size 16777216
2020-09-23 13:19:36.311009: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23fba0000 next 280 of size 16777216
2020-09-23 13:19:36.311012: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x240ba0000 next 282 of size 12582912
2020-09-23 13:19:36.311015: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2417a0000 next 284 of size 4194304
2020-09-23 13:19:36.311020: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x241ba0000 next 290 of size 16777216
2020-09-23 13:19:36.311023: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x242ba0000 next 18446744073709551615 of size 29360128
2020-09-23 13:19:36.311026: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 130543104
2020-09-23 13:19:36.311029: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2447a0000 next 294 of size 12582912
2020-09-23 13:19:36.311032: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2453a0000 next 296 of size 4194304
2020-09-23 13:19:36.311035: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2457a0000 next 302 of size 16777216
2020-09-23 13:19:36.311037: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2467a0000 next 304 of size 16777216
2020-09-23 13:19:36.311040: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2477a0000 next 306 of size 12582912
2020-09-23 13:19:36.311043: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2483a0000 next 308 of size 4194304
2020-09-23 13:19:36.311046: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2487a0000 next 314 of size 16777216
2020-09-23 13:19:36.311049: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2497a0000 next 316 of size 16777216
2020-09-23 13:19:36.311052: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x24a7a0000 next 318 of size 12582912
2020-09-23 13:19:36.311055: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x24b3a0000 next 320 of size 4194304
2020-09-23 13:19:36.311058: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x24b7a0000 next 18446744073709551615 of size 13102592
2020-09-23 13:19:36.311061: I tensorflow/core/common_runtime/bfc_allocator.cc:914]      Summary of in-use Chunks by size: 
2020-09-23 13:19:36.311065: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 98 Chunks of size 256 totalling 24.5KiB
2020-09-23 13:19:36.311069: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 113 Chunks of size 4096 totalling 452.0KiB
2020-09-23 13:19:36.311073: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 19 Chunks of size 12288 totalling 228.0KiB
2020-09-23 13:19:36.311076: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 18 Chunks of size 16384 totalling 288.0KiB
2020-09-23 13:19:36.311079: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 32256 totalling 31.5KiB
2020-09-23 13:19:36.311083: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 19 Chunks of size 4194304 totalling 76.00MiB
2020-09-23 13:19:36.311086: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 18 Chunks of size 12582912 totalling 216.00MiB
2020-09-23 13:19:36.311089: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 36 Chunks of size 16777216 totalling 576.00MiB
2020-09-23 13:19:36.311093: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 29360128 totalling 28.00MiB
2020-09-23 13:19:36.311096: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 268435456 totalling 256.00MiB
2020-09-23 13:19:36.311099: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 1.13GiB
2020-09-23 13:19:36.311102: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 1222110720 memory_limit_: 68719476736 available bytes: 67497366016 curr_region_allocation_bytes_: 2147483648
2020-09-23 13:19:36.311108: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats: 
Limit:                 68719476736
InUse:                  1209008128
MaxInUse:               1209008128
NumAllocs:                     762
MaxAllocSize:            268435456```

Not sure what to do from here..
1

There are 1 answers

0
AudioBubble On

There can be many reasons for OOM issues, below are some of the common reasons and workaround to fix the issue.

  • Make sure you are not running evaluation and training on the same GPU, this will hold the process and causes OOM issues. You can try evaluating on different GPU.
  • Reducing the batch size will slow down your training but it will avoid OOM issues.
  • If you have large data, then try reducing the size if it is image data or use can use tf.data.Dataset format to reduce the memory consumption.