What should I do when I'm getting an mAP of 0.000 using keras-retinanet / resnet50?

748 views Asked by At

I am currently using fizyr/retinanet to train a model that detects 3 classes. When I train the model, I receive precisions of 0.0000 on all my classes. In some rounds of training, I received slightly higher precisions e.g. 0.0007.

I have looked at these threads, but it doesn't seem like their solutions work: https://github.com/fizyr/keras-retinanet/issues/647 and https://github.com/fizyr/keras-retinanet/issues/1351

That is, I added the --image-max-side argument to my training command. I made this 2560 pixels. The images I am working with are 1920X2560 pixels. Training set is 916 images. Validation set is 258 images.

The full command that I use to train the model is:

python train.py \
    --weights old_snapshots/resnet50_coco_best_v2.h5 \
    --backbone resnet50 \
    --batch-size 1 \
    --image-max-side 2560 \
    --epochs 50 \
    --steps 200 \
    --lr 1e-8 \
    --snapshot-path new_snapshots \
    --tensorboard-dir logs \
    --random-transform \
    csv \
    train.csv \
    classes.csv \
    --val-annotations validation.csv

I have also tried running the above command without initializing the weights to coco. This produces the same result. I have copied the train.py file into my parent directory.

I had to include this extra piece of code in train.py so that training did not get stopped by GPU running out of resources:

devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(devices[0], True)

Here is a sample from my train.csv file:

dataset/202009/2020-09-18_20-26-16-480016.jpg,645,1178,819,1366,object1
dataset/202009/2020-09-18_20-26-16-480016.jpg,669,1306,1015,1486,object2
dataset/202009/2020-09-14_07-13-59-258711.jpg,,,,,
dataset/202009/2020-09-14_18-58-25-411295.jpg,,,,,
dataset/202009/2020-09-21_20-43-20-525886.jpg,1154,1214,1501,1429,object2
dataset/202009/2020-09-21_20-43-20-525886.jpg,1509,1176,1707,1396,object1
dataset/202009/2020-09-14_19-32-17-116910.jpg,,,,,

Here is my classes.csv file:

object1,0
object2,1
object3,2

My installation setup is: Windows 10
Tensflow 2.3.1
CUDA Toolkit 11.0
CuDNN v7.6.3

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 451.48       Driver Version: 451.48       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro T2000       WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   50C    P8     5W /  N/A |    370MiB /  4096MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

The precision does not change over multiple epochs. Here is a sample of the output:

2020-10-06 13:15:11.249841: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
2020-10-06 13:15:13.280542: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library nvcuda.dll
2020-10-06 13:15:13.326726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro T2000 computeCapability: 7.5
...
2020-10-06 13:15:13.419482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
Creating model, this may take a second...
2020-10-06 13:15:14.118835: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-10-06 13:15:14.142741: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x21262d9de70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-06 13:15:14.151023: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-06 13:15:14.157396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro T2000 computeCapability: 7.5
coreClock: 1.785GHz coreCount: 16 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 119.24GiB/s
2020-10-06 13:15:14.169198: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
...
2020-10-06 13:15:14.208255: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2020-10-06 13:15:14.214516: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-10-06 13:15:14.783905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-06 13:15:14.791223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
2020-10-06 13:15:14.797282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
2020-10-06 13:15:14.801179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2905 MB memory) -> physical GPU (device: 0, name:
Quadro T2000, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-10-06 13:15:14.818767: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2120ce3da40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-10-06 13:15:14.825738: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Quadro T2000, Compute Capability 7.5
Model: "retinanet"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            [(None, None, None,  0
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, None, None, 6 9408        input_1[0][0]
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (None, None, None, 6 256         conv1[0][0]
__________________________________________________________________________________________________
conv1_relu (Activation)         (None, None, None, 6 0           bn_conv1[0][0]
__________________________________________________________________________________________________
pool1 (MaxPooling2D)            (None, None, None, 6 0           conv1_relu[0][0]
__________________________________________________________________________________________________
res2a_branch2a (Conv2D)         (None, None, None, 6 4096        pool1[0][0]
__________________________________________________________________________________________________
bn2a_branch2a (BatchNormalizati (None, None, None, 6 256         res2a_branch2a[0][0]
__________________________________________________________________________________________________
res2a_branch2a_relu (Activation (None, None, None, 6 0           bn2a_branch2a[0][0]
__________________________________________________________________________________________________
...

P4_merged (Add)                 (None, None, None, 2 0           P5_upsampled[0][0]
                                                                 C4_reduced[0][0]
__________________________________________________________________________________________________
P4_upsampled (UpsampleLike)     (None, None, None, 2 0           P4_merged[0][0]
                                                                 res3d_relu[0][0]
__________________________________________________________________________________________________
C3_reduced (Conv2D)             (None, None, None, 2 131328      res3d_relu[0][0]
__________________________________________________________________________________________________
P6 (Conv2D)                     (None, None, None, 2 4718848     res5c_relu[0][0]
__________________________________________________________________________________________________
P3_merged (Add)                 (None, None, None, 2 0           P4_upsampled[0][0]
                                                                 C3_reduced[0][0]
__________________________________________________________________________________________________
C6_relu (Activation)            (None, None, None, 2 0           P6[0][0]
__________________________________________________________________________________________________
P3 (Conv2D)                     (None, None, None, 2 590080      P3_merged[0][0]
__________________________________________________________________________________________________
P4 (Conv2D)                     (None, None, None, 2 590080      P4_merged[0][0]
__________________________________________________________________________________________________
P5 (Conv2D)                     (None, None, None, 2 590080      C5_reduced[0][0]
__________________________________________________________________________________________________
P7 (Conv2D)                     (None, None, None, 2 590080      C6_relu[0][0]
__________________________________________________________________________________________________
regression_submodel (Functional (None, None, 4)      2443300     P3[0][0]
                                                                 P4[0][0]
                                                                 P5[0][0]
                                                                 P6[0][0]
                                                                 P7[0][0]
__________________________________________________________________________________________________
classification_submodel (Functi (None, None, 3)      2422555     P3[0][0]
                                                                 P4[0][0]
                                                                 P5[0][0]
                                                                 P6[0][0]
                                                                 P7[0][0]
__________________________________________________________________________________________________
regression (Concatenate)        (None, None, 4)      0           regression_submodel[0][0]
                                                                 regression_submodel[1][0]
                                                                 regression_submodel[2][0]
                                                                 regression_submodel[3][0]
                                                                 regression_submodel[4][0]
__________________________________________________________________________________________________
classification (Concatenate)    (None, None, 3)      0           classification_submodel[0][0]
                                                                 classification_submodel[1][0]
                                                                 classification_submodel[2][0]
                                                                 classification_submodel[3][0]
                                                                 classification_submodel[4][0]
==================================================================================================
Total params: 36,424,447
Trainable params: 36,318,207
Non-trainable params: 106,240
__________________________________________________________________________________________________
None
WARNING:tensorflow:`batch_size` is no longer needed in the `TensorBoard` Callback and will be ignored in TensorFlow 2.0.
2020-10-06 13:15:17.712389: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-10-06 13:15:17.721698: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1391] Profiler found 1 GPUs
2020-10-06 13:15:17.749155: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cupti64_101.dll
2020-10-06 13:15:17.855545: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1513] CUPTI activity buffer flushed
WARNING:tensorflow:From train_latest_fizyr.py:541: Model.fit_generator (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
Please use Model.fit, which supports generators.
Epoch 1/2
2020-10-06 13:15:25.776950: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2020-10-06 13:15:28.004983: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: Invoking GPU asm compilation is supported on Cuda non-Windows platforms only
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
2020-10-06 13:15:28.121843: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
2020-10-06 13:15:29.193580: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-10-06 13:15:29.209383: I tensorflow/stream_executor/cuda/cuda_driver.cc:775] failed to allocate 858.70M (900412160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-10-06 13:15:29.337869: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.16GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-10-06 13:15:29.363332: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-10-06 13:15:29.464090: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
...
2020-10-06 13:15:30.261915: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.16GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
  1/200 [..............................] - ETA: 0s - loss: 3.9458 - regression_loss: 2.8127 - classification_loss: 1.13312020-10-06 13:15:31.922292: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
WARNING:tensorflow:From C:\XXXXX\venv38\lib\site-packages\tensorflow\python\ops\summary_ops_v2.py:1277: stop (from tensorflow.python.eager.profiler) is deprecated and will be removed after 2020-07-01.
Instructions for updating:
use `tf.profiler.experimental.stop` instead.
2020-10-06 13:15:32.542621: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1513] CUPTI activity buffer flushed
2020-10-06 13:15:32.580191: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 3193 callback api events and 3193 activity events.
2020-10-06 13:15:32.695250: I tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: logs\train\plugins\profile\2020_10_06_11_15_32
2020-10-06 13:15:32.734889: I tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for trace.json.gz to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.trace.json.gz
2020-10-06 13:15:32.857585: I tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: logs\train\plugins\profile\2020_10_06_11_15_32
2020-10-06 13:15:32.874147: I tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for memory_profile.json.gz to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.memory_profile.json.gz
2020-10-06 13:15:32.901109: I tensorflow/python/profiler/internal/profiler_wrapper.cc:111] Creating directory: logs\train\plugins\profile\2020_10_06_11_15_32Dumped tool data for xplane.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.xplane.pb
Dumped tool data for overview_page.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.overview_page.pb
Dumped tool data for input_pipeline.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.kernel_stats.pb

  2/200 [..............................] - ETA: 1:41 - loss: 3.8811 - regression_loss: 2.7477 - classification_loss: 1.1334WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0590s vs `on_train_batch_end` time: 0.9618s). Check your callbacks.
Running network: 100% (165 of 165) |#########################################################################################################################################| Elapsed Time: 0:00:42 Time:  0:00:42
Parsing annotations: 100% (165 of 165) |#####################################################################################################################################| Elapsed Time: 0:00:00 Time:  0:00:00
100 instances of class object1 with average precision: 0.0000
97 instances of class object2 with average precision: 0.0000
15 instances of class object3 with average precision: 0.0000
mAP: 0.0000

If there are any suggestions on what to try to increase my precision / troubleshoot why it isn't finding any objects, please let me know?

0

There are 0 answers