All-to-all data exchange, involving bidirectional data transfer among all participating processes, is a fundamental class of data movement in HPC systems. This contains two modalities: uniform (fixed-size) and non-uniform (variable-size) exchange patterns. The significance of these exchange protocols has grown considerably with the advent of exascale computing, as they serve as critical communication primitives for numerous large-scale applications. These include data-driven machine learning, graph mining, fast Fourier transform, quantum computer simulations, and certain advanced preconditioners and solvers. Implementing all-to-all in a manner that scales for different workload sizes, distribution patterns, process counts, and hardware configurations is very challenging.
To facilitate scalability across various configurations, tunability emerges as the principal consideration. The ability to dynamically tune execution parameters for maximum bandwidth utilization while minimizing latency overhead across different input configurations is key to achieving scalability. This requires an analytical focus on two critical metrics: the total number of communication rounds (latency-relevant) and the total volume of data exchanged (bandwidth-relevant). Existing implementations typically rely on algorithms with extremal radix (log-base) values of 2 (latency minimization) and process count (bandwidth maximization). This limits the ability to dynamically optimize performance across the latency–bandwidth spectrum and significantly constrains adaptability to diverse demands. Additionally, current approaches illustrate a fundamental gap between theoretical algorithm design and practical hardware constraints, forfeiting performance gains available through multi-tiered parallelism in modern distributed systems.
This thesis addresses these challenges by designing scalable and tunable algorithms for adaptive all-to-all data exchange, enabling optimal performance across various communication demands. To achieve it, we first formulate a mathematical model that characterizes communication patterns in uniform all-to-all through parameterized expressions quantifying total communication rounds and data exchanged. This model facilitates informed radix selection and effectively balances latency and bandwidth considerations. Subsequently, we develop a tunable radix-based algorithm for uniform all-to-all, which is then extended to non-uniform cases via a novel two-phase communication scheme. We then design hierarchical parameterized algorithms for both uniform and non-uniform all-to-all, strategically leveraging the multi-tiered parallelism of modern HPC architectures. Lastly, we integrate machine learning techniques to autonomously calibrate algorithmic parameters, enabling dynamic and adaptive all-to-all optimization.
History
Advisor
Sidharth Kumar
Department
Computer Science
Degree Grantor
University of Illinois Chicago
Degree Level
Doctoral
Degree name
PhD, Doctor of Philosophy
Committee Member
Michael Papka
Zhiling Lan
Thomas Gilray
Rajeev Thakur
Venkatram Vishwanath