High-throughput technology, such as microarray and next generation sequencing has accelerated the identification of uncovered biomarkers and developing of novel diagnosis approach in precision medicine. Meanwhile, with the ability to measure tons of biomarkers simultaneously in one single experiment, collecting enough biological samples has become the bottleneck of data accumulating. Feature selection is a common strategy tackle this ‘small n and large p’ scenario. Most of current feature selection methods are purely based on statistics theories. However, based on the experiences in analyzing high-throughput data in various projects, I believe biological knowledge could play an important role in feature selection. Therefore, in this dissertation, I present computational investigations of the biological knowledge integrated feature selection methods when dealing with high-dimensional omics data.
Firstly, I present two bioinformatics practices of analyzing high-throughput data in biomedical researches including characterization of H3K27ac profile across different PM2.5 exposures, and investigation of batch stability in iPSC technology. Inspired by the experiences of biomedical research practices, I then design three biomedical knowledge integrated feature selection methods for high-dimensional omics data analysis. (1) To integrate domain knowledge, I develop SKI, in which two ranks are generated before feature selection, one is based on marginal correlation from omics data in hand, and another is external knowledge provided by domain experts, literatures or databases. By combining two ranks into a new rank, biomarkers are prescreened, and a further feature selection approach such as LASSO is performed. In a simulation study, I show SKI outperforms other methods without knowledge integration. I then apply SKI in a gene expression dataset to predict drug-response in different cell lines. A higher prediction accuracy is achieved by using SKI method than regular LASSO-based method. (2) To integrate multi-omics data, such as methylation and copy number variants, for survival data analysis, I develop two methods SKI-Cox, and wLASSO-Cox. Cox regression is a common model for survival data analysis. SKI-Cox prescreens genes based on different levels of omics data, and further selects genes in a transcriptome-based Cox regression model. wLASSO-Cox puts the marginal utilities derived from Cox-regression model on other omics-data as the penalty factors in a penalized Cox regression on mRNA expression. By simulation, I show two methods could select more true variables when analyzing omics based survival data. And Better performance is achieved in terms of overall survival time predicting in glioblastoma and lung adenocarcinoma patients using TCGA dataset. (3) To integrate pathway or gene set information, my colleagues and I develop a redundancy removable pathway (RRP) based feature selection method for binary and multi-class classification problems. Both strategies in (1) and (2) have the limitation of considering the genes (features) as the independent variables and ignoring the hidden relations among them. Our method uses a greedy algorithm to search the gene set whose distinguishing power is maximized for a specific pathway, and pathway activities inferred from the expression of selected genes, are used for a multi-class K nearest neighbor classifier. By testing our method in three sarcomas microarray datasets, we show our method is a robust feature selection method for multi-class classification.
Overall, the above studies have provided more flexible approaches with knowledge integration to select biological relevant features in analyzing high-throughput omics data. The success of applying them in the real-world datasets have demonstrated a close interaction between biologists and statistician is critical to decipher the complex biological data generated in biomedical researches.