University of Illinois at Chicago
Browse
ALMASI-DISSERTATION-2023.pdf (5.97 MB)

Latency Optimization in Datacenters using Adaptive Transport and Reliable Training

Download (5.97 MB)
thesis
posted on 2023-08-01, 00:00 authored by Hamidreza Almasi
Datacenters perform the core computation for a wide range of large-scale applications. These applications include online data-intensive services such as web search that are user-facing and must meet stringent latency constraints, as well as distributed DNN training of sophisticated models that are time-consuming tasks. In this dissertation, we first address the performance challenges with transport protocols to ensure data from latency-sensitive applications is transferred efficiently across the datacenter network. Then we analyze the contention of different traffic patterns for switch buffer resources and propose a scheme that resolves the inherent tension between good burst absorption and high utilization. Finally, we study datacenter failures that impede the progress of distributed training jobs and waste many hours of computing resources, and to reduce the recovery time, we present a new optimization-based framework for robust gradient aggregation that allows the training task to continue in the presence of failures.

History

Advisor

Vamanan, Balajee

Chair

Vamanan, Balajee

Department

Computer Science

Degree Grantor

University of Illinois at Chicago

Degree Level

  • Doctoral

Degree name

PhD, Doctor of Philosophy

Committee Member

Ravi, Sathya N. Eriksson, Jakob Grechanik, Mark Seferoglu, Hulya

Submitted date

August 2023

Thesis type

application/pdf

Language

  • en

Usage metrics

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC