University of Illinois Chicago
Browse

MScompress: A Versatile Compression Tool for Efficient Storage of Mass-Spectrometry Data

thesis
posted on 2024-05-01, 00:00 authored by Christopher Grams
Mass Spectrometry (MS) is an indispensable tool for high-throughput identification and quantification of proteins. Recent years, due to improvements in instrumentation and computational hardware, the quality and quantity of mass spectrometry data have increased significantly. Public data repositories such as PRIDE and MASSIVE has been established by National Institute of Health (NIH) and the European Bioinformatics Institute (EBI) to ensure all data adhere to the FAIR principle (findable, accessible, interoperable, and reusable). However, the rapid expansion of MS data volume in public data repositories calls for more sophisticated solutions for data packing and compression. Prior works mostly focused on reducing the storage expenses by improving compression ratios without much concern about increased processing times and energy consumption. With the recent advancements of multi-core processors and high-speed storage solutions, a modern solution to MS data compression is required to harness the benefits of modern computer hardware. We present MScompress, a novel platform-independent and multi-threaded MS data compression solution that achieves state-of-the-art compression/decompression speeds by exploiting the prior knowledge of mass spectrometry data structure and standard mzML file format standard. Our proposed MScompress file format (msz) includes both lossless and lossy compression options of MS binary data, while preserving a compressed yet lossless representation of the mzML’s XML structure. Furthermore, it allows for direct querying of individual spectra directly from the compressed msz file, ensuring seamless future integration into existing processing pipelines. In our initial test, MScompress is 5-30x faster in both compression and decompression speed than existing solutions while archiving similar or better compression ratio. MScompress is available in a command-line interface written in C, along with a native NodeJS Addon utilizing the Node-API. Additionally, a front-end GUI written in the Electron framework is provided to allow for easy pipeline integration of the novel file format.

History

Advisor

Michael Papka

Department

Computer Science

Degree Grantor

University of Illinois Chicago

Degree Level

  • Masters

Degree name

MS, Master of Science

Committee Member

Yu Gao Zhiling Lan Sidharth Kumar

Thesis type

application/pdf

Language

  • en

Usage metrics

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC