I am currently using fizyr/retinanet to train a model that detects 3 classes. When I train the model, I receive precisions of 0.0000 on all my classes. In some rounds of training, I received slightly higher precisions e.g. 0.0007.
I have looked at these threads, but it doesn't seem like their solutions work: https://github.com/fizyr/keras-retinanet/issues/647 and https://github.com/fizyr/keras-retinanet/issues/1351
That is, I added the --image-max-side argument to my training command. I made this 2560 pixels. The images I am working with are 1920X2560 pixels. Training set is 916 images. Validation set is 258 images.
The full command that I use to train the model is:
python train.py \
--weights old_snapshots/resnet50_coco_best_v2.h5 \
--backbone resnet50 \
--batch-size 1 \
--image-max-side 2560 \
--epochs 50 \
--steps 200 \
--lr 1e-8 \
--snapshot-path new_snapshots \
--tensorboard-dir logs \
--random-transform \
csv \
train.csv \
classes.csv \
--val-annotations validation.csv
I have also tried running the above command without initializing the weights to coco. This produces the same result. I have copied the train.py file into my parent directory.
I had to include this extra piece of code in train.py so that training did not get stopped by GPU running out of resources:
devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(devices[0], True)
Here is a sample from my train.csv file:
dataset/202009/2020-09-18_20-26-16-480016.jpg,645,1178,819,1366,object1
dataset/202009/2020-09-18_20-26-16-480016.jpg,669,1306,1015,1486,object2
dataset/202009/2020-09-14_07-13-59-258711.jpg,,,,,
dataset/202009/2020-09-14_18-58-25-411295.jpg,,,,,
dataset/202009/2020-09-21_20-43-20-525886.jpg,1154,1214,1501,1429,object2
dataset/202009/2020-09-21_20-43-20-525886.jpg,1509,1176,1707,1396,object1
dataset/202009/2020-09-14_19-32-17-116910.jpg,,,,,
Here is my classes.csv file:
object1,0
object2,1
object3,2
My installation setup is:
Windows 10
Tensflow 2.3.1
CUDA Toolkit 11.0
CuDNN v7.6.3
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 451.48 Driver Version: 451.48 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro T2000 WDDM | 00000000:01:00.0 On | N/A |
| N/A 50C P8 5W / N/A | 370MiB / 4096MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
The precision does not change over multiple epochs. Here is a sample of the output:
2020-10-06 13:15:11.249841: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
2020-10-06 13:15:13.280542: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library nvcuda.dll
2020-10-06 13:15:13.326726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro T2000 computeCapability: 7.5
...
2020-10-06 13:15:13.419482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
Creating model, this may take a second...
2020-10-06 13:15:14.118835: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-10-06 13:15:14.142741: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x21262d9de70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-06 13:15:14.151023: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-10-06 13:15:14.157396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro T2000 computeCapability: 7.5
coreClock: 1.785GHz coreCount: 16 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 119.24GiB/s
2020-10-06 13:15:14.169198: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
...
2020-10-06 13:15:14.208255: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2020-10-06 13:15:14.214516: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-10-06 13:15:14.783905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-06 13:15:14.791223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2020-10-06 13:15:14.797282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2020-10-06 13:15:14.801179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2905 MB memory) -> physical GPU (device: 0, name:
Quadro T2000, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-10-06 13:15:14.818767: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2120ce3da40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-10-06 13:15:14.825738: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Quadro T2000, Compute Capability 7.5
Model: "retinanet"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, None, None, 0
__________________________________________________________________________________________________
conv1 (Conv2D) (None, None, None, 6 9408 input_1[0][0]
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization) (None, None, None, 6 256 conv1[0][0]
__________________________________________________________________________________________________
conv1_relu (Activation) (None, None, None, 6 0 bn_conv1[0][0]
__________________________________________________________________________________________________
pool1 (MaxPooling2D) (None, None, None, 6 0 conv1_relu[0][0]
__________________________________________________________________________________________________
res2a_branch2a (Conv2D) (None, None, None, 6 4096 pool1[0][0]
__________________________________________________________________________________________________
bn2a_branch2a (BatchNormalizati (None, None, None, 6 256 res2a_branch2a[0][0]
__________________________________________________________________________________________________
res2a_branch2a_relu (Activation (None, None, None, 6 0 bn2a_branch2a[0][0]
__________________________________________________________________________________________________
...
P4_merged (Add) (None, None, None, 2 0 P5_upsampled[0][0]
C4_reduced[0][0]
__________________________________________________________________________________________________
P4_upsampled (UpsampleLike) (None, None, None, 2 0 P4_merged[0][0]
res3d_relu[0][0]
__________________________________________________________________________________________________
C3_reduced (Conv2D) (None, None, None, 2 131328 res3d_relu[0][0]
__________________________________________________________________________________________________
P6 (Conv2D) (None, None, None, 2 4718848 res5c_relu[0][0]
__________________________________________________________________________________________________
P3_merged (Add) (None, None, None, 2 0 P4_upsampled[0][0]
C3_reduced[0][0]
__________________________________________________________________________________________________
C6_relu (Activation) (None, None, None, 2 0 P6[0][0]
__________________________________________________________________________________________________
P3 (Conv2D) (None, None, None, 2 590080 P3_merged[0][0]
__________________________________________________________________________________________________
P4 (Conv2D) (None, None, None, 2 590080 P4_merged[0][0]
__________________________________________________________________________________________________
P5 (Conv2D) (None, None, None, 2 590080 C5_reduced[0][0]
__________________________________________________________________________________________________
P7 (Conv2D) (None, None, None, 2 590080 C6_relu[0][0]
__________________________________________________________________________________________________
regression_submodel (Functional (None, None, 4) 2443300 P3[0][0]
P4[0][0]
P5[0][0]
P6[0][0]
P7[0][0]
__________________________________________________________________________________________________
classification_submodel (Functi (None, None, 3) 2422555 P3[0][0]
P4[0][0]
P5[0][0]
P6[0][0]
P7[0][0]
__________________________________________________________________________________________________
regression (Concatenate) (None, None, 4) 0 regression_submodel[0][0]
regression_submodel[1][0]
regression_submodel[2][0]
regression_submodel[3][0]
regression_submodel[4][0]
__________________________________________________________________________________________________
classification (Concatenate) (None, None, 3) 0 classification_submodel[0][0]
classification_submodel[1][0]
classification_submodel[2][0]
classification_submodel[3][0]
classification_submodel[4][0]
==================================================================================================
Total params: 36,424,447
Trainable params: 36,318,207
Non-trainable params: 106,240
__________________________________________________________________________________________________
None
WARNING:tensorflow:`batch_size` is no longer needed in the `TensorBoard` Callback and will be ignored in TensorFlow 2.0.
2020-10-06 13:15:17.712389: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-10-06 13:15:17.721698: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1391] Profiler found 1 GPUs
2020-10-06 13:15:17.749155: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cupti64_101.dll
2020-10-06 13:15:17.855545: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1513] CUPTI activity buffer flushed
WARNING:tensorflow:From train_latest_fizyr.py:541: Model.fit_generator (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
Please use Model.fit, which supports generators.
Epoch 1/2
2020-10-06 13:15:25.776950: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2020-10-06 13:15:28.004983: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: Invoking GPU asm compilation is supported on Cuda non-Windows platforms only
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
2020-10-06 13:15:28.121843: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
2020-10-06 13:15:29.193580: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-10-06 13:15:29.209383: I tensorflow/stream_executor/cuda/cuda_driver.cc:775] failed to allocate 858.70M (900412160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-10-06 13:15:29.337869: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.16GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-10-06 13:15:29.363332: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-10-06 13:15:29.464090: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
...
2020-10-06 13:15:30.261915: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.16GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
1/200 [..............................] - ETA: 0s - loss: 3.9458 - regression_loss: 2.8127 - classification_loss: 1.13312020-10-06 13:15:31.922292: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
WARNING:tensorflow:From C:\XXXXX\venv38\lib\site-packages\tensorflow\python\ops\summary_ops_v2.py:1277: stop (from tensorflow.python.eager.profiler) is deprecated and will be removed after 2020-07-01.
Instructions for updating:
use `tf.profiler.experimental.stop` instead.
2020-10-06 13:15:32.542621: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1513] CUPTI activity buffer flushed
2020-10-06 13:15:32.580191: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:223] GpuTracer has collected 3193 callback api events and 3193 activity events.
2020-10-06 13:15:32.695250: I tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: logs\train\plugins\profile\2020_10_06_11_15_32
2020-10-06 13:15:32.734889: I tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for trace.json.gz to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.trace.json.gz
2020-10-06 13:15:32.857585: I tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: logs\train\plugins\profile\2020_10_06_11_15_32
2020-10-06 13:15:32.874147: I tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for memory_profile.json.gz to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.memory_profile.json.gz
2020-10-06 13:15:32.901109: I tensorflow/python/profiler/internal/profiler_wrapper.cc:111] Creating directory: logs\train\plugins\profile\2020_10_06_11_15_32Dumped tool data for xplane.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.xplane.pb
Dumped tool data for overview_page.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.overview_page.pb
Dumped tool data for input_pipeline.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.kernel_stats.pb
2/200 [..............................] - ETA: 1:41 - loss: 3.8811 - regression_loss: 2.7477 - classification_loss: 1.1334WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0590s vs `on_train_batch_end` time: 0.9618s). Check your callbacks.
Running network: 100% (165 of 165) |#########################################################################################################################################| Elapsed Time: 0:00:42 Time: 0:00:42
Parsing annotations: 100% (165 of 165) |#####################################################################################################################################| Elapsed Time: 0:00:00 Time: 0:00:00
100 instances of class object1 with average precision: 0.0000
97 instances of class object2 with average precision: 0.0000
15 instances of class object3 with average precision: 0.0000
mAP: 0.0000
If there are any suggestions on what to try to increase my precision / troubleshoot why it isn't finding any objects, please let me know?