University of Illinois Chicago
Browse

CPIExtract: A Framework for Collecting and Harmonizing Small Molecule-Protein Interaction Data

thesis
posted on 2025-05-01, 00:00 authored by Andrea Piras
The binding interactions between small molecules (compounds) and proteins are fundamental to cellular functions and essential for understanding biological mechanisms. However, data on compound-protein interactions (CPI) are dispersed across multiple databases, each with unique formats and curation standards, creating significant challenges for researchers seeking to utilize this information. This work presents CPIExtract, a framework designed to systematically extract, filter, and harmonize CPI data from nine major databases into a single, unified format. By overcoming data heterogeneity, CPIExtract greatly expands the accessible collection of CPI data, providing over ten times the annotations available in a single database like DrugBank. The standardized datasets generated by CPIExtract enable researchers to streamline analysis and readily apply the information in disparate biomedical research applications. Namely, CPIExtract’s data aids the improvement of machine learning models, such as AI-Bind, for drug discovery. Integrating harmonized CPI data into their training improves these models' generalizability and performance, especially in predicting interactions for understudied compounds and proteins. This work highlights CPIExtract’s potential to accelerate the discovery and design of therapeutic agents by supplying robust, comprehensive datasets that bridge the gaps in current CPI databases.

History

Advisor

Piotr Gmytrasiewicz

Department

Computer Science

Degree Grantor

University of Illinois Chicago

Degree Level

  • Masters

Degree name

MS, Master of Science

Committee Member

Zhiling Lan Marco D. Santambrogio

Thesis type

application/pdf

Language

  • en

Usage metrics

    Dissertations and Theses

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC