Have MXNet used Nvidia's NCCL library for multi-GPU communication?

680 views Asked by At

In Nvidia website, they claimed MXNet uses NCCL (https://developer.nvidia.com/nccl). However, I haven't found any reference from MXNet's github repository that they actually use NCCL library.

In the chainer blog, they also claimed that chainer achieves better performance than MXNet on 4 GPUs because of the use of NCCL library in chainer.(https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html)

In some of the old posts in MXNet repository, I can see that they were talking about the difficulty in including the NCCL library in MXNet.

My first question is, is there any version of MXNet with NCCL library? Second, what might be the performance implications of using NCCL library (i.e. less memory usage, lesser communication overhead across multiple GPUs)?

1

There are 1 answers

1
Chris Olivier On

There is no official release at this time that supports NCCL.

1) There was a PR for this which was closed (see discussion here: https://github.com/apache/incubator-mxnet/issues/2919). It's possible to pull in that code to an older commit.

2) See quote from ptrendx@ about performance related to NCCL on Sept 10:

"As part of supporting DGX, NVIDIA provides optimized versions of most major DL frameworks as docker containers. The version of MXNet that is part of this DGX software stack has NCCL support (which I guess is why that page lists MXNet as supported). We do upstream our optimizations and NCCL support is available as a PR since February (#5521), but it is not yet accepted to upstream MXNet due to API required.

That said, MXNet has actually very good communication scheme and as long as your network does not have a very large number of parameters (for which you need bandwidth given by NCCL and NVLink) you may get as good or better results with MXNet's native device kvstore."