Bruck Algorithm Performance Analysis for Multi-GPU All-to-All Communication

Sewell, Andres; Fan, Ke; Shovon, Ahmedur Rahman; Dyken, Landon; Kumar, Sidharth; Petruzza, Steve

doi:10.25417/uic.26170144.v1

HPC_asia2024.pdf (1.19 MB)

Bruck Algorithm Performance Analysis for Multi-GPU All-to-All Communication

conference contribution

posted on 2024-07-08, 20:23 authored by Andres Sewell, Ke Fan, Ahmedur Rahman Shovon, Landon Dyken, Sidharth Kumar, Steve Petruzza

In high-performance computing, collective communication is critical for facilitating comprehensive data exchange involving all processes within an MPI communicator. Due to their inherently global nature, many collective operations present scalability challenges, particularly the all-to-all data shuffle with its quadratic communication pattern. Using a logarithmic communication pattern, the Bruck algorithm was designed to provide communication efficiency for all-to-all data shuffles involving short-sized messages. The Bruck algorithm has been extensively used to facilitate global data shuffles in a multi-CPU environment and is also part of the MPICH and Open MPI implementations. This work presents the first investigation of using the Bruck algorithm for all-to-all communication in multi-GPU systems using the NVIDIA Collective Communications Library (NCCL). Our experimental study demonstrates that while the Bruck algorithm exhibits superior performance for small-sized messages in a multi-CPU environment, the same advantages are not evident for multi-GPU environments. Furthermore, we describe and compare an optimized Bruck algorithm implementation in NCCL and compare it to NCCL's default all-to-all and MPI-based implementations. Finally, we discuss the challenges and opportunities of implementing new multi-GPU collectives using NCCL's public-facing API.

Funding

Collaborative Research: SHF: Small: Scalable and Extensible I/O Runtime and Tools for Next Generation Adaptive Data Layouts | Funder: National Science Foundation | Grant ID: CCF-2401274

Collaborative Research: SHF: Small: Scalable and Extensible I/O Runtime and Tools for Next Generation Adaptive Data Layouts | Funder: National Science Foundation | Grant ID: 2401274

Collaborative Research: PPoSS: Large: A Full-stack Approach to Declarative Analytics at Scale | Funder: University of Alabama at Birmingham | Grant ID: 2316157

Collaborative Research: PPoSS: Large: A Full-stack Approach to Declarative Analytics at Scale | Funder: University of Alabama at Birmingham

History

Citation

Sewell, A., Fan, K., Shovon, A. R., Dyken, L., Kumar, S.Petruzza, S. (2024, January). Bruck Algorithm Performance Analysis for Multi-GPU All-to-All Communication. Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (pp. 127-133). Association for Computing Machinery (ACM). https://doi.org/10.1145/3635035.3635047

Publisher

Association for Computing Machinery (ACM)

Usage metrics

Keywords

46 Information and Computing Sciences 4601 Applied Computing

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Bruck Algorithm Performance Analysis for Multi-GPU All-to-All Communication

Funding

Collaborative Research: SHF: Small: Scalable and Extensible I/O Runtime and Tools for Next Generation Adaptive Data Layouts | Funder: National Science Foundation | Grant ID: CCF-2401274

Collaborative Research: SHF: Small: Scalable and Extensible I/O Runtime and Tools for Next Generation Adaptive Data Layouts | Funder: National Science Foundation | Grant ID: 2401274

Collaborative Research: PPoSS: Large: A Full-stack Approach to Declarative Analytics at Scale | Funder: University of Alabama at Birmingham | Grant ID: 2316157

Collaborative Research: PPoSS: Large: A Full-stack Approach to Declarative Analytics at Scale | Funder: University of Alabama at Birmingham

History

Citation

Publisher

Usage metrics

Categories

Keywords

Licence

Exports