INDIGO Home University of Illinois at Urbana-Champaign logo uic building uic pavilion uic student center

Investigation of Gene Regulation and its Application to Disease Using Machine Learning and Network Models

Show full item record

Bookmark or cite this item: http://hdl.handle.net/10027/10061

Files in this item

File Description Format
PDF Carson_Matthew.pdf (15MB) (no description provided) PDF
Title: Investigation of Gene Regulation and its Application to Disease Using Machine Learning and Network Models
Author(s): Carson, Matthew B.
Advisor(s): Lu, Hui
Contributor(s): Dai, Yang; Liang, Jie; Kibbe, Warren; Jia, Caiyan
Department / Program: Bioengineering
Graduate Major: Bioinformatics
Degree Granting Institution: University of Illinois at Chicago
Degree: PhD, Doctor of Philosophy
Genre: Doctoral
Subject(s): machine learning gene regulation transcription factor network alternating decision tree disease network
Abstract: Gene regulation is one of the most important functions in the cell. Changes in the way a gene is regulated can either increase the versatility and adaptability of an organism or it can be detrimental. In this work we study gene regulation from three perspectives. Firstly, we focus on nucleic acid-binding prediction on the protein and the residue levels. We predict DNA-binding proteins with 88% accuracy and create a classifier based on C4.5, bootstrap aggregation, and cost-sensitive learning. We obtain balanced sensitivity, specificity, and precision with high accuracy when training and testing on imbalanced residue-level data sets. Secondly, we focus on DNA-binding sites and two current bioinformatics problems concerning protein binding: transcription factor binding site prediction and DNA methylation prediction. We define a knowledge-based interaction potential for genome-wide binding sites and rank 59% of the true sites from the CRP protein in the top five when compared with random sequences. Also, we use an alternating decision tree (ADTree) to find highly discriminating rules that differentiate between methylated and non-methylated CpG islands in human DNA. Thirdly, we take a global perspective and study human molecular networks in the context of disease. We examine the number of partnership interactions between transcription factors and how it scales with the number of target genes regulated. In several model organisms and our own generative model, we find that the distribution of the number of partners vs. the number of target genes appears to follow an exponential saturation curve. Next, we search for conserved motifs in the transcription factor network and identify the location of disease-related genes within these structures. We find that both cancer and disease genes occupy certain positions more frequently. We also predict disease genes in the protein-protein interaction network with 79% AUC using ADTree, which identifies important attributes for prediction such as degree and disease neighbor ratio. Finally, we create a co-occurrence matrix for 1854 diseases based on shared gene uniqueness and find previously known and potentially undiscovered relationships. This matrix will be useful for making disease connections which are not obvious and for the identification of potential drug targets.
Issue Date: 2013-06-28
Genre: thesis
URI: http://hdl.handle.net/10027/10061
Rights Information: Copyright 2013 Matthew B. Carson
Date Available in INDIGO: 2013-06-28
2015-06-29
Date Deposited: 2013-05
 

This item appears in the following Collection(s)

Show full item record

Statistics

Country Code Views
United States of America 301
China 159
Russian Federation 29
Germany 7
United Kingdom 7

Browse

My Account

Information

Access Key