Lawrence Jengar
Jul 15, 2025 06:21
NVIDIA launches NCCL 2.27 to improve AI workloads with faster GPU communication, lower latency, and enhanced resilience, addressing the demands of modern AI infrastructures.
NVIDIA has announced the release of NCCL 2.27, an upgrade to its Collective Communications Library, aimed at significantly enhancing the efficiency of AI workloads by improving GPU communication. This latest version is designed to cater to the growing demands of both training and inference tasks, ensuring fast and reliable operations at scale, according to NVIDIA’s official blog.
Key Performance Enhancements
The NCCL 2.27 release focuses on reducing latency and increasing bandwidth efficiency across GPUs. Key improvements include low-latency kernels with symmetric memory, which optimize collective operations by using buffers with identical virtual addresses. These updates result in a notable reduction in latency, up to 7.6x for small message sizes, making it ideal for real-time inference pipelines.
Another significant feature is the introduction of Direct NIC support, which facilitates full network bandwidth utilization for GPU scale-out communications. This is particularly beneficial for high-throughput inference and training workloads, ensuring networking efficiency without saturating CPU-GPU bandwidth.
New Support for NVLink and InfiniBand SHARP
NCCL 2.27 also introduces support for SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) for NVLink and InfiniBand fabrics. This protocol offloads compute-intensive tasks, enhancing large-scale training by reducing the computational demand on GPUs and improving scalability and performance, especially for large language model (LLM) training.
Resilience with Communicator Shrink
To address the challenges of large-scale distributed training, NCCL 2.27 features the Communicator Shrink function. This allows for dynamic exclusion of failed or unnecessary GPUs, ensuring uninterrupted training processes. It supports both default and error modes for planned reconfigurations and unexpected device failures, respectively.
Enhanced Developer Tools
The update also brings new features for developers, including symmetric memory APIs and enhanced profiling tools. These enhancements provide developers with more precise instrumentation for diagnosing communication performance and optimizing AI workloads.
For more information on NCCL 2.27 and its new capabilities, interested parties can visit the NVIDIA/nccl GitHub repository.
Image source: Shutterstock