posted on 2024-07-08, 20:26authored byKe Fan, Steve Petruzza, Thomas Gilray, Sidharth Kumar
MPI_Alltoall is a commonly used collective that allows a fixed-size data block to be exchanged between every pair of processes. The function can be implemented through a logarithmic number of point-to-point communication rounds, where the exact number of rounds and total data exchanged among processes depend on the log base (radix). This paper presents a mathematical foundation for studying all communication patterns for the all-to-all collective by developing parameterized formulas for total communication rounds and data exchanged. The model is used to narrow down a radix, √P (P: process count), that effectively balances latency and bandwidth concerns, yielding optimal performance-as also confirmed via evaluation on the Theta and Polaris supercomputers at ANL. We also present a novel two-layer tunable radix algorithm to take advantage of the shared-memory parallelism offered by modern systems. The algorithm decouples communication rounds into two phases that can be individually optimized to take advantage of the shared memory and high-speed interconnect separately. Our approach demonstrates improvements of up to 3.8× on Theta and 4.2× on Polaris over the vendor-optimized MPICH-based implementation of MPI_Alltoall for fast Fourier transform application.
Funding
Collaborative Research: SHF: Small: Scalable and Extensible I/O Runtime and Tools for Next Generation Adaptive Data Layouts | Funder: National Science Foundation | Grant ID: CCF-2401274
Collaborative Research: SHF: Small: Scalable and Extensible I/O Runtime and Tools for Next Generation Adaptive Data Layouts | Funder: National Science Foundation | Grant ID: 2401274
Collaborative Research: PPoSS: Large: A Full-stack Approach to Declarative Analytics at Scale | Funder: University of Alabama at Birmingham | Grant ID: 2316157
Collaborative Research: PPoSS: Large: A Full-stack Approach to Declarative Analytics at Scale | Funder: University of Alabama at Birmingham
History
Citation
Fan, K., Petruzza, S., Gilray, T.Kumar, S. (2024, May). Configurable Algorithms for All-to-All Collectives. ISC High Performance 2024 Research Paper Proceedings (39th International Conference) (pp. 1-12). Institute of Electrical and Electronics Engineers (IEEE). https://doi.org/10.23919/isc.2024.10528936
Publisher
Institute of Electrical and Electronics Engineers (IEEE)