Python CNTK speed comparation of 1bit SGD with normal SGD in 4 GPUs

876 views Asked by At

I installed version 2.0.beta7 from CNTK in an Azure NC24 GPU VM with Ubuntu (python 3.4). The machine has 4 NVIDIA K80 GPUs. Build info:

            Build type: release
            Build target: GPU
            With 1bit-SGD: yes
            With ASGD: yes
            Math lib: mkl
            CUDA_PATH: /usr/local/cuda-8.0
            CUB_PATH: /usr/local/cub-1.4.1
            CUDNN_PATH: /usr/local
            Build Branch: HEAD
            Build SHA1: 8e8b5ff92eff4647be5d41a5a515956907567126
            Built by Source/CNTK/buildinfo.h$$0 on bbdadbf3455d
            Build Path: /home/philly/jenkins/workspace/CNTK-Build-Linux

I was running the CIFAR example in distributed mode:

mpiexec -n 4 python TrainResNet_CIFAR10_Distributed.py -n resnet20 -q 32

Finished Epoch [1]: [Training] loss = 1.675002 * 50176, metric = 62.5% * 50176 112.019s (447.9 samples per second)
Finished Epoch [1]: [Training] loss = 1.675002 * 50176, metric = 62.5% * 50176 112.019s (447.9 samples per second)
Finished Epoch [1]: [Training] loss = 1.675002 * 50176, metric = 62.5% * 50176 112.018s (447.9 samples per second)
Finished Epoch [1]: [Training] loss = 1.675002 * 50176, metric = 62.5% * 50176 112.019s (447.9 samples per second)
Finished Epoch [2]: [Training] loss = 1.247423 * 50176, metric = 45.4% * 50176 8.210s (6111.3 samples per second)
Finished Epoch [2]: [Training] loss = 1.247423 * 50176, metric = 45.4% * 50176 8.210s (6111.4 samples per second)
Finished Epoch [2]: [Training] loss = 1.247423 * 50176, metric = 45.4% * 50176 8.210s (6111.8 samples per second)
Finished Epoch [2]: [Training] loss = 1.247423 * 50176, metric = 45.4% * 50176 8.210s (6111.6 samples per second)
...
...
Finished Epoch [160]: [Training] loss = 0.037745 * 49664, metric = 1.2% * 49664 7.883s (6300.4 samples per second)
Finished Epoch [160]: [Training] loss = 0.037745 * 49664, metric = 1.2% * 49664 7.883s (6299.7 samples per second)
Finished Epoch [160]: [Training] loss = 0.037745 * 49664, metric = 1.2% * 49664 7.884s (6299.7 samples per second)
Finished Epoch [160]: [Training] loss = 0.037745 * 49664, metric = 1.2% * 49664 7.884s (6299.2 samples per second)

However, when I run it with 1bit SGD I get:

mpiexec -n 4 python TrainResNet_CIFAR10_Distributed.py -n resnet20 -q 1 -a 50000

...
Finished Epoch [160]: [Training] loss = 0.059290 * 49664, metric = 2.1% * 49664 10.055s (4939.1 samples per second)
Finished Epoch [160]: [Training] loss = 0.059290 * 49664, metric = 2.1% * 49664 10.056s (4938.9 samples per second)
Finished Epoch [160]: [Training] loss = 0.059290 * 49664, metric = 2.1% * 49664 10.056s (4938.9 samples per second)
Finished Epoch [160]: [Training] loss = 0.059290 * 49664, metric = 2.1% * 49664 10.056s (4938.9 samples per second)

As explained here 1bit should be faster than the normal counterpart. Any help is appreciated.

1

There are 1 answers

2
Nikos Karampatziakis On BEST ANSWER

1-bit sgd is an effective strategy when the communication time between GPUs is large compared to the computation time for a minibatch.

There are two "issues" with your experiment above: the model you are training has few parameters (computation is not that much) and the 4 GPUs are in the same machine (communication is not that bad compared to say going over the network). Also, inside a machine CNTK uses the NVIDIA nccl which is much better optimized than a generic MPI implementation that 1-bit uses. Update: At the time of this comment NCCL is not used by default.