CPIExtract: A Framework for Collecting and Harmonizing Small Molecule-Protein Interaction Data
thesis
posted on 2025-05-01, 00:00authored byAndrea Piras
The binding interactions between small molecules (compounds) and proteins are fundamental to cellular functions and essential for understanding biological mechanisms. However, data on compound-protein interactions (CPI) are dispersed across multiple databases, each with unique formats and curation standards, creating significant challenges for researchers seeking to utilize this information. This work presents CPIExtract, a framework designed to systematically extract, filter, and harmonize CPI data from nine major databases into a single, unified format. By overcoming data heterogeneity, CPIExtract greatly expands the accessible collection of CPI data, providing over ten times the annotations available in a single database like DrugBank. The standardized datasets generated by CPIExtract enable researchers to streamline analysis and readily apply the information in disparate biomedical research applications.
Namely, CPIExtract’s data aids the improvement of machine learning models, such as AI-Bind, for drug discovery. Integrating harmonized CPI data into their training improves these models' generalizability and performance, especially in predicting interactions for understudied compounds and proteins. This work highlights CPIExtract’s potential to accelerate the discovery and design of therapeutic agents by supplying robust, comprehensive datasets that bridge the gaps in current CPI databases.