9 Star 14 Fork 3

Gitee 极速下载/Horovod

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
此仓库是为了提升国内下载速度的镜像仓库,每日同步一次。 原始仓库: https://github.com/uber/horovod
克隆/下载
gpus.rst 4.41 KB
一键复制 编辑 原始数据 按行查看 历史

Horovod on GPU

To use Horovod on GPU, read the options below and see which one applies to you best.

Have GPUs?

In most situations, using NCCL 2 will significantly improve performance over the CPU version. NCCL 2 provides the allreduce operation optimized for NVIDIA GPUs and a variety of networking devices, such as RoCE or InfiniBand.

  1. Install NCCL 2 following these steps.

    If you have installed NCCL 2 using the nccl-<version>.txz package, you should add the library path to LD_LIBRARY_PATH environment variable or register it in /etc/ld.so.conf.

    $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nccl-<version>/lib
    

2. (Optional) If you're using an NVIDIA Tesla GPU and NIC with GPUDirect RDMA support, you can further speed up NCCL 2 by installing an nv_peer_memory driver.

GPUDirect allows GPUs to transfer memory among each other without CPU involvement, which significantly reduces latency and load on CPU. NCCL 2 is able to use GPUDirect automatically for allreduce operation if it detects it.
  1. Install Open MPI or another MPI implementation following these steps.

    Note: Open MPI 3.1.3 has an issue that may cause hangs. The recommended fix is to downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.

  2. If you installed TensorFlow from PyPI, make sure that g++-5 or above is installed.

    If you installed PyTorch from PyPI, make sure that g++-5 or above is installed.

    If you installed either package from Conda, make sure that the gxx_linux-64 Conda package is installed.

  3. Install the horovod pip package.

    If you installed NCCL 2 using the nccl-<version>.txz package, you should specify the path to NCCL 2 using the HOROVOD_NCCL_HOME environment variable.

    $ HOROVOD_NCCL_HOME=/usr/local/nccl-<version> HOROVOD_GPU_OPERATIONS=NCCL pip install --no-cache-dir horovod
    

    If you installed NCCL 2 using the Ubuntu package, you can run:

    $ HOROVOD_GPU_OPERATIONS=NCCL pip install --no-cache-dir horovod
    

    If you installed NCCL 2 using the CentOS / RHEL package, you can run:

    $ HOROVOD_NCCL_INCLUDE=/usr/include HOROVOD_NCCL_LIB=/usr/lib64 HOROVOD_GPU_OPERATIONS=NCCL pip install --no-cache-dir horovod
    

Note: Some models with a high computation to communication ratio benefit from doing allreduce on CPU, even if a GPU version is available. To force allreduce to happen on CPU, pass device_dense='/cpu:0' to hvd.DistributedOptimizer:

opt = hvd.DistributedOptimizer(opt, device_dense='/cpu:0')

Advanced: Have a proprietary MPI implementation with GPU support optimized for your network?

This section is only relevant if you have a proprietary MPI implementation with GPU support, i.e. not Open MPI or MPICH. Most users should follow one of the sections above.

If your MPI vendor's implementation of allreduce operation on GPU is faster than NCCL 2, you can configure Horovod to use it instead:

$ HOROVOD_GPU_ALLREDUCE=MPI pip install --no-cache-dir horovod

Additionally, if your MPI vendor's implementation supports allgather, broadcast, and reducescatter operations on GPU, you can configure Horovod to use them as well:

$ HOROVOD_GPU_OPERATIONS=MPI pip install --no-cache-dir horovod

Note: Allgather allocates an output tensor which is proportionate to the number of processes participating in the training. If you find yourself running out of GPU memory, you can force allgather to happen on CPU by passing device_sparse='/cpu:0' to hvd.DistributedOptimizer:

opt = hvd.DistributedOptimizer(opt, device_sparse='/cpu:0')
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
C/C++
1
https://gitee.com/mirrors/Horovod.git
git@gitee.com:mirrors/Horovod.git
mirrors
Horovod
Horovod
master

搜索帮助