posted on 2024-07-08, 20:23authored byAndres Sewell, Ke Fan, Ahmedur Rahman Shovon, Landon Dyken, Sidharth Kumar, Steve Petruzza
In high-performance computing, collective communication is critical for facilitating comprehensive data exchange involving all processes within an MPI communicator. Due to their inherently global nature, many collective operations present scalability challenges, particularly the all-to-all data shuffle with its quadratic communication pattern. Using a logarithmic communication pattern, the Bruck algorithm was designed to provide communication efficiency for all-to-all data shuffles involving short-sized messages. The Bruck algorithm has been extensively used to facilitate global data shuffles in a multi-CPU environment and is also part of the MPICH and Open MPI implementations. This work presents the first investigation of using the Bruck algorithm for all-to-all communication in multi-GPU systems using the NVIDIA Collective Communications Library (NCCL). Our experimental study demonstrates that while the Bruck algorithm exhibits superior performance for small-sized messages in a multi-CPU environment, the same advantages are not evident for multi-GPU environments. Furthermore, we describe and compare an optimized Bruck algorithm implementation in NCCL and compare it to NCCL's default all-to-all and MPI-based implementations. Finally, we discuss the challenges and opportunities of implementing new multi-GPU collectives using NCCL's public-facing API.
Funding
Collaborative Research: SHF: Small: Scalable and Extensible I/O Runtime and Tools for Next Generation Adaptive Data Layouts | Funder: National Science Foundation | Grant ID: CCF-2401274
Collaborative Research: SHF: Small: Scalable and Extensible I/O Runtime and Tools for Next Generation Adaptive Data Layouts | Funder: National Science Foundation | Grant ID: 2401274
Collaborative Research: PPoSS: Large: A Full-stack Approach to Declarative Analytics at Scale | Funder: University of Alabama at Birmingham | Grant ID: 2316157
Collaborative Research: PPoSS: Large: A Full-stack Approach to Declarative Analytics at Scale | Funder: University of Alabama at Birmingham
History
Citation
Sewell, A., Fan, K., Shovon, A. R., Dyken, L., Kumar, S.Petruzza, S. (2024, January). Bruck Algorithm Performance Analysis for Multi-GPU All-to-All Communication. Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (pp. 127-133). Association for Computing Machinery (ACM). https://doi.org/10.1145/3635035.3635047