posted on 2025-08-01, 00:00authored byAdam Haller Ross
Efficient workload scheduling is a critical challenge in modern heterogeneous computing environments, particularly in high-performance computing (HPC) systems. Traditional software-based schedulers struggle to efficiently balance workload distribution due to high scheduling overhead, lack of adaptability to dynamic workloads, and suboptimal resource utilization. These pitfalls are compounded in heterogeneous systems, where differing computational elements can have vastly different performance profiles. To resolve these hindrances, we presented a novel FPGA-based accelerator for stochastic online scheduling (SOS), in which we modified a greedy cost selection assignment policy by adapting existing cost equations to engage with discretized time before implementing them into a hardware accelerator design. The proposed design achieved high throughput, low latency, and energy-efficient operation, offering an alternative to traditional software scheduling methods, all of which make the SOS accelerator a strong candidate for deployment in high-performance computing systems, deep learning
pipelines, and other performance-critical applications. By introducing a hardware-accelerated approach to real-time scheduling, this work established a new paradigm for adaptive scheduling mechanisms in heterogeneous computing systems.
While this presented design leveraged hardware parallelism, pre-calculation, and precision quantization to reduce job scheduling latency, the presented architecture’s operational flow was envisioned from a task centric perspective. This perspective resulted in a pipelined design in which tracking of previously assigned tasks was handled by multiple, decentralized memory elements. Coherency of these discrete memories had to be strictly maintained leading to several performance bottlenecks, such as slow iteration speeds, high interconnectivity congestion, and severe resource usage.
In this thesis, I present an alternative architecture, which is rooted in a machine schedule centric perspective. Inspired by systolic processing architectures, this new design centralizes the memory elements into connected structures which leverage the imposed ordering of WSPT scheduling to both simplify memory management and to further accelerate scheduling decisions. These enhancements have
resulted in an FPGA SOS scheduler design with lower latency and a smaller hardware footprint. Due to these improvements, the new design is also capable of scheduling for larger heterogeneous systems, with the new design being capable of synthesizing bitstreams for system configurations with 14x more
machines than the previous design was capable of. All of these improvements in performance metrics were achieved without increasing the power usage of the FPGA performing the scheduling operation over its baseline power draw, with each design size maintaining 21W of power draw. In improving upon the scheduler’s latency, size, and scalability, while maintaining overall power draw, this thesis further improves and highlights the capability of hardware-based acceleration in heterogeneous, real-time scheduling.