The identification of transcriptional regulatory elements is of pivotal importance in understanding the molecular mechanisms that govern specific expression patterns. Despite the fast development of next generation sequencing (NGS) technology for profiling genome-wide transcription factors (TFs) binding, DNA methylation and other epigenetic features, the identification of transcription factor binding sites (TFBSs) with active functional relevance remains a challenging task in system biology. Therefore, computational prediction still plays an important role. However, the sequence-based prediction can not represent the dynamics of transcription regulation in cell-specific or condition-specific manner. In addition, the overwhelming potential TFBSs obtained through experimental or computation procedures with an unknown false positive (FP) rate also prohibit the reliable biological findings. The integration of other types of functional genomics data are crucial for the elucidation of regulatory mechanism.
In this thesis, we explore the potential of machine learning methods in distinguishing the most causal TFBSs from enormous predicted candidates. We first focus on the reconstruction of transcriptional regulatory network(TRN) using TFBSs predicted in the promoter regions of co-expressed genes, under the commonly-accepted assumption that a set of genes showing similar expression profiles are likely to be commonly regulated by a collection of TFs. We propose a penalized multinomial logistic regression model to prioritize the most representative TFBSs for each set of co-expressed genes, among multiple sets simultaneously. Joint effects of the TF interaction are also considered. The results through cross-validation show that the minimum classification error rate can be reached at 0.302 and the prioritized TFBSs are not obtained by chance. On this basis, we further model gene expression time course using the predicted TFBSs from a set of co-expressed genes to reconstruct TRN. The expression of one gene at one time point is modeled as the linear combination of expression of all non-TF coding genes and the expression of all TF coding genes weight by transformed TFBSs binding scores at the previous time point. The evaluation shows high performance in simulation study (AUC=0.85). Our model also successfully identify the TF coding genes causal for cell apoptosis in MCF-7:5C cell line which is sensitive to E2-induced apoptosis.
Lastly, we integrate DNA methylation and gene expression data to identify the TFBSs located in remote regulatory regions. Previous studies have indicated that Low-methylated regions (LMRs) are potential active distal regulatory regions (enhancers) in mammalian genomes. We propose several lasso-penalized logistic regression models to predict the directional change of differentially expressed (DE) genes using predicted TFBSs in pairwise cell-type-specific LMRs (dLMRs). The models are evaluated on pairs from four cell types. The AUCs from 10-fold cross-validation procedure show that the model using TFBSs in dLMRs in intergenic or genebody region has more predictive power (AUC 0.71 and 0.66 respectively), comparing with the one using TFBSs from promoter regions alone (AUC 0.62). When using the TFBSs in dLMRs from both intergenic and genebody regions together, the best prediction was obtained (AUC=0.78). Our models are capable to identify subsets of LMRs in which the binding sites of the insulator protein CTCF, p300 co-activator and other TFs verified before by ChIP-seq are significantly enriched. In summary, our models provide tools that detect distal and proximal TFBSs which may causally regulate gene expression.