So i'm trying to finetune the medium model on a TITAN RTX (24G) in WSL2 but it seems to run out of memory? the small model fits. If i boot my computer on a live ubuntu I can train the medium and large model with on issues.
2020-09-23 13:19:36.310992: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23b7a0000 next 260 of size 4194304
2020-09-23 13:19:36.310995: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23bba0000 next 266 of size 16777216
2020-09-23 13:19:36.310998: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23cba0000 next 268 of size 16777216
2020-09-23 13:19:36.311001: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23dba0000 next 270 of size 12582912
2020-09-23 13:19:36.311004: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23e7a0000 next 272 of size 4194304
2020-09-23 13:19:36.311006: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23eba0000 next 278 of size 16777216
2020-09-23 13:19:36.311009: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23fba0000 next 280 of size 16777216
2020-09-23 13:19:36.311012: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x240ba0000 next 282 of size 12582912
2020-09-23 13:19:36.311015: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2417a0000 next 284 of size 4194304
2020-09-23 13:19:36.311020: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x241ba0000 next 290 of size 16777216
2020-09-23 13:19:36.311023: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x242ba0000 next 18446744073709551615 of size 29360128
2020-09-23 13:19:36.311026: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 130543104
2020-09-23 13:19:36.311029: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2447a0000 next 294 of size 12582912
2020-09-23 13:19:36.311032: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2453a0000 next 296 of size 4194304
2020-09-23 13:19:36.311035: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2457a0000 next 302 of size 16777216
2020-09-23 13:19:36.311037: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2467a0000 next 304 of size 16777216
2020-09-23 13:19:36.311040: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2477a0000 next 306 of size 12582912
2020-09-23 13:19:36.311043: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2483a0000 next 308 of size 4194304
2020-09-23 13:19:36.311046: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2487a0000 next 314 of size 16777216
2020-09-23 13:19:36.311049: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2497a0000 next 316 of size 16777216
2020-09-23 13:19:36.311052: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x24a7a0000 next 318 of size 12582912
2020-09-23 13:19:36.311055: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x24b3a0000 next 320 of size 4194304
2020-09-23 13:19:36.311058: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free at 0x24b7a0000 next 18446744073709551615 of size 13102592
2020-09-23 13:19:36.311061: I tensorflow/core/common_runtime/bfc_allocator.cc:914] Summary of in-use Chunks by size:
2020-09-23 13:19:36.311065: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 98 Chunks of size 256 totalling 24.5KiB
2020-09-23 13:19:36.311069: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 113 Chunks of size 4096 totalling 452.0KiB
2020-09-23 13:19:36.311073: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 19 Chunks of size 12288 totalling 228.0KiB
2020-09-23 13:19:36.311076: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 18 Chunks of size 16384 totalling 288.0KiB
2020-09-23 13:19:36.311079: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 32256 totalling 31.5KiB
2020-09-23 13:19:36.311083: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 19 Chunks of size 4194304 totalling 76.00MiB
2020-09-23 13:19:36.311086: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 18 Chunks of size 12582912 totalling 216.00MiB
2020-09-23 13:19:36.311089: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 36 Chunks of size 16777216 totalling 576.00MiB
2020-09-23 13:19:36.311093: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 29360128 totalling 28.00MiB
2020-09-23 13:19:36.311096: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 268435456 totalling 256.00MiB
2020-09-23 13:19:36.311099: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 1.13GiB
2020-09-23 13:19:36.311102: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 1222110720 memory_limit_: 68719476736 available bytes: 67497366016 curr_region_allocation_bytes_: 2147483648
2020-09-23 13:19:36.311108: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit: 68719476736
InUse: 1209008128
MaxInUse: 1209008128
NumAllocs: 762
MaxAllocSize: 268435456```
Not sure what to do from here..
There can be many reasons for OOM issues, below are some of the common reasons and workaround to fix the issue.
batch size
will slow down your training but it will avoid OOM issues.tf.data.Dataset
format to reduce the memory consumption.