Hi I have setup horovod on a k8s cluster with 2 GPU nodes using spark-operator. I have executed the mnist example (https://archive-docs.d2iq.com/dkp/kaptain/1.2.0/tutorials/training/spark/) using tensorflow, and it was executed successfully on both nodes (utilizing GPUs on both nodes). However when I am using KerasEstimator on spark, the training executes successfully but I think that only one gpu is getting used.
I am following this example: https://docs.databricks.com/_static/notebooks/deep-learning/horovod-spark-estimator-keras.html
here are the logs:
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Bootstrap : Using eth0:10.84.52.31<0>
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO NET/IB : No device found.
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO NET/Socket : Using [0]eth0:10.84.52.31<0>
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Using network Socket
[1,0]:NCCL version 2.11.4+cuda11.4
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Bootstrap : Using eth0:10.84.179.52<0>
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO NET/IB : No device found.
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO NET/Socket : Using [0]eth0:10.84.179.52<0>
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Using network Socket
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 00/02 : 0 1
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 01/02 : 0 1
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 00 : 0[2000] -> 1[4000] [receive] via NET/Socket/0
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 01 : 0[2000] -> 1[4000] [receive] via NET/Socket/0
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 00 : 1[4000] -> 0[2000] [send] via NET/Socket/0
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 01 : 1[4000] -> 0[2000] [send] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 00 : 1[4000] -> 0[2000] [receive] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 01 : 1[4000] -> 0[2000] [receive] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 00 : 0[2000] -> 1[4000] [send] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 01 : 0[2000] -> 1[4000] [send] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Connected all rings
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Connected all trees
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO comm 0x7fd2247488e0 rank 0 nranks 2 cudaDev 0 busId 2000 - Init COMPLETE
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Launch mode Parallel
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Connected all rings
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Connected all trees
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO comm 0x7fad647478a0 rank 1 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE
[1,0]:
[1,1]:WARNING:tensorflow:Callback method on_train_batch_end is slow compared to the batch time (batch time: 0.0086s vs on_train_batch_end time: 0.0658s). Check your callbacks.
[1,0]:WARNING:tensorflow:Callback method on_train_batch_end is slow compared to the batch time (batch time: 0.0053s vs on_train_batch_end time: 0.0687s). Check your callbacks.
1/4851 [..............................] - ETA: 8:35:39 - loss: 1.0356 - accuracy: 0.4844[1,0]: ����
9/4851 [..............................] - ETA: 30s - loss: 0.9629 - accuracy: 0.4219 [1,0]:
17/4851 [..............................] - ETA: 31s - loss: 0.9131 - accuracy: 0.4265[1,0]:
24/4851 [..............................] - ETA: 33s - loss: 0.8747 - accuracy: 0.4421[1,0]:
31/4851 [..............................] - ETA: 34s - loss: 0.8364 - accuracy: 0.4768[1,0]:
39/4851 [..............................] - ETA: 34s - loss: 0.7905 - accuracy: 0.5445[1,0]:
48/4851 [..............................] - ETA: 32s - loss: 0.7389 - accuracy: 0.6286[1,0]:
56/4851 [..............................] - ETA: 32s - loss: 0.6957 - accuracy: 0.6816[1,0]:
64/4851 [..............................] - ETA: 32s - loss: 0.6540 - accuracy: 0.7214[1,0]:
71/4851 [..............................] - ETA: 32s - loss: 0.6205 - accuracy: 0.7489[1,0]:
79/4851 [..............................] - ETA: 32s - loss: 0.5844 - accuracy: 0.7743[1,0]:
87/4851 [..............................] - ETA: 32s - loss: 0.5504 - accuracy: 0.7951[1,0]:
95/4851 [..............................] - ETA: 32s - loss: 0.5194 - accuracy: 0.8123[1,0]:
103/4851 [..............................] - ETA: 32s - loss: 0.4912 - accuracy: 0.8269[1,0]:
112/4851 [..............................] - ETA: 31s - loss: 0.4623 - accuracy: 0.8408[1,0]:
121/4851 [..............................] - ETA: 31s - loss: 0.4364 - accuracy: 0.8525[1,0]:
131/4851 [..............................] - ETA: 30s - loss: 0.4106 - accuracy: 0.8637[1,0]:
140/4851 [..............................] - ETA: 30s - loss: 0.3886 - accuracy: 0.8724[1,0]:
148/4851 [..............................] - ETA: 30s - loss: 0.3706 - accuracy: 0.8793[1,0]:
156/4851 [..............................] - ETA: 30s - loss: 0.3542 - accuracy: 0.8855[1,0]:
164/4851 [>.............................] - ETA: 30s - loss: 0.3388 - accuracy: 0.8911[1,0]:
172/4851 [>.............................] - ETA: 30s - loss: 0.3246 - accuracy: 0.8962[1,0]:
180/4851 [>.............................] - ETA: 30s - loss: 0.3116 - accuracy: 0.9008[1,0]:
188/4851 [>.............................] - ETA: 30s - loss: 0.2994 - accuracy: 0.9050[1,0]:
196/4851 [>.............................] - ETA: 30s - loss: 0.2882 - accuracy: 0.9089[1,0]:
204/4851 [>.............................] - ETA: 30s - loss: 0.2778 - accuracy: 0.9125[1,0]:
212/4851 [>.............................] - ETA: 30s - loss: 0.2680 - accuracy: 0.9158[1,0]:
220/4851 [>.............................] - ETA: 30s - loss: 0.2588 - accuracy: 0.9188[1,0]:
227/4851 [>.............................] - ETA: 30s - loss: 0.2513 - accuracy: 0.9213[1,0]:
235/4851 [>.............................] - ETA: 30s - loss: 0.2432 - accuracy: 0.9240[1,0]:
243/4851 [>.............................] - ETA: 30s - loss: 0.2356 - accuracy: 0.9265[1,0]:
251/4851 [>.............................] - ETA: 30s - loss: 0.2285 - accuracy: 0.9288[1,0]:
259/4851 [>.............................] - ETA: 30s - loss: 0.2218 - accuracy: 0.9310[1,0]:
267/4851 [>.............................] - ETA: 30s - loss: 0.2155 - accuracy: 0.9331[1,0]:
275/4851 [>.............................] - ETA: 30s - loss: 0.2095 - accuracy: 0.9351[1,0]:
283/4851 [>.............................] - ETA: 30s - loss: 0.2038 - accuracy: 0.9369[1,0]:
291/4851 [>.............................] - ETA: 30s - loss: 0.1985 - accuracy: 0.9386[1,0]:
299/4851 [>.............................] - ETA: 30s - loss: 0.1933 - accuracy: 0.9403[1,0]:
307/4851 [>.............................] - ETA: 30s - loss: 0.1885 - accuracy: 0.9418[1,0]:
316/4851 [>.............................] - ETA: 30s - loss: 0.1833 - accuracy: 0.9435[1,0]:
325/4851 [=>............................] - ETA: 30s - loss: 0.1784 - accuracy: 0.9450[1,0]:
334/4851 [=>............................] - ETA: 30s - loss: 0.1738 - accuracy: 0.9465[1,0]:
343/4851 [=>............................] - ETA: 30s - loss: 0.1694 - accuracy: 0.9479[1,0]:
351/4851 [=>............................] - ETA: 29s - loss: 0.1656 - accuracy: 0.9491[1,0]:
358/4851 [=>............................] - ETA: 30s - loss: 0.1625 - accuracy: 0.9501[1,0]:
366/4851 [=>............................] - ETA: 29s - loss: 0.1590 - accuracy: 0.9512[1,0]:
374/4851 [=>............................] - ETA: 29s - loss: 0.1557 - accuracy: 0.9522[1,0]:
383/4851 [=>............................] - ETA: 29s - loss: 0.1521 - accuracy: 0.9534[1,0]:
391/4851 [=>............................] - ETA: 29s - loss: 0.1491 - accuracy: 0.9543[1,0]:
400/4851 [=>............................] - ETA: 29s - loss: 0.1458 - accuracy: 0.9554[1,0]:
408/4851 [=>............................] - ETA: 29s - loss: 0.1430 - accuracy: 0.9562[1,0]:
417/4851 [=>............................] - ETA: 29s - loss: 0.1400 - accuracy: 0.9572[1,0]:
422/4851 [=>............................] - ETA: 29s - loss: 0.1384 - accuracy: 0.9577[1,0]:
428/4851 [=>............................] - ETA: 29s - loss: 0.1365 - accuracy: 0.9583[1,0]:
437/4851 [=>............................] - ETA: 29s - loss: 0.1338 - accuracy: 0.9591[1,0]:
447/4851 [=>............................] - ETA: 29s - loss: 0.1314 - accuracy: 0.9600[1,0]:
456/4851 [=>............................] - ETA: 29s - loss: 0.1289 - accuracy: 0.9608[1,0]:
465/4851 [=>............................] - ETA: 29s - loss: 0.1264 - accuracy: 0.9616[1,0]:
474/4851 [=>............................] - ETA: 29s - loss: 0.1241 - accuracy: 0.9623[1,0]:
483/4851 [=>............................] - ETA: 29s - loss: 0.1218 - accuracy: 0.9630[1,0]:
491/4851 [==>...........................] - ETA: 28s - loss: 0.1199 - accuracy: 0.9636[1,0]:
499/4851 [==>...........................] - ETA: 28s - loss: 0.1180 - accuracy: 0.9642[1,0]:
508/4851 [==>...........................] - ETA: 28s - loss: 0.1160 - accuracy: 0.9648[1,0]:
518/4851 [==>...........................] - ETA: 28s - loss: 0.1138 - accuracy: 0.9655[1,0]:
527/4851 [==>...........................] - ETA: 28s - loss: 0.1118 - accuracy: 0.9661[1,0]:
536/4851 [==>...........................] - ETA: 28s - loss: 0.1100 - accuracy: 0.9667[1,0]:
545/4851 [==>...........................] - ETA: 28s - loss: 0.1082 - accuracy: 0.9672[1,0]:
554/4851 [==>...........................] - ETA: 28s - loss: 0.1065 - accuracy: 0.9677[1,0]:
562/4851 [==>...........................] - ETA: 28s - loss: 0.1050 - accuracy: 0.9682[1,0]:
572/4851 [==>...........................] - ETA: 27s - loss: 0.1032 - accuracy: 0.9688[1,0]:
I have tried different spark and horovod configurations, however horovod is not utilizing 2nd node