Investigation of Gene Regulation and its Application to Disease Using Machine Learning and Network Models
thesisposted on 28.06.2013 by Matthew B. Carson
In order to distinguish essays and pre-prints from academic theses, we have a separate category. These are often much longer text based documents than a paper.
Gene regulation is one of the most important functions in the cell. Changes in the way a gene is regulated can either increase the versatility and adaptability of an organism or it can be detrimental. In this work we study gene regulation from three perspectives. Firstly, we focus on nucleic acid-binding prediction on the protein and the residue levels. We predict DNA-binding proteins with 88% accuracy and create a classifier based on C4.5, bootstrap aggregation, and cost-sensitive learning. We obtain balanced sensitivity, specificity, and precision with high accuracy when training and testing on imbalanced residue-level data sets. Secondly, we focus on DNA-binding sites and two current bioinformatics problems concerning protein binding: transcription factor binding site prediction and DNA methylation prediction. We define a knowledge-based interaction potential for genome-wide binding sites and rank 59% of the true sites from the CRP protein in the top five when compared with random sequences. Also, we use an alternating decision tree (ADTree) to find highly discriminating rules that differentiate between methylated and non-methylated CpG islands in human DNA. Thirdly, we take a global perspective and study human molecular networks in the context of disease. We examine the number of partnership interactions between transcription factors and how it scales with the number of target genes regulated. In several model organisms and our own generative model, we find that the distribution of the number of partners vs. the number of target genes appears to follow an exponential saturation curve. Next, we search for conserved motifs in the transcription factor network and identify the location of disease-related genes within these structures. We find that both cancer and disease genes occupy certain positions more frequently. We also predict disease genes in the protein-protein interaction network with 79% AUC using ADTree, which identifies important attributes for prediction such as degree and disease neighbor ratio. Finally, we create a co-occurrence matrix for 1854 diseases based on shared gene uniqueness and find previously known and potentially undiscovered relationships. This matrix will be useful for making disease connections which are not obvious and for the identification of potential drug targets.