Protein Design and Chromatin Structure: Novel Computational Approaches
thesisposted on 21.06.2016 by Yun Xu
In order to distinguish essays and pre-prints from academic theses, we have a separate category. These are often much longer text based documents than a paper.
Constructing fitness landscape has broad implication in molecular evolution, cellular epigenetic state, and protein design. We studied the problem of constructing fitness landscape of inverse protein folding. Computational inverse protein folding or protein design aims to generate amino acid sequences that fold into an a priori determined structural fold for engineering novel or enhanced biochemistry. For this task, a function describing the fitness landscape of sequences is critical to identify correct ones that fold into the desired structure. In this study, we showed that nonlinear fitness function for protein design can be significantly improved. Using a rectangular kernel with a basis set of proteins and decoys chosen a priori, we obtained a simplified nonlinear kernel function via a finite Newton method. The full landscape for a large number of protein folds can be captured using only 480 native proteins and 3,200 non-protein decoys. A blind test of a simplified version of sequence design was carried out to discriminate simultaneously 428 native sequences not homologous to any training proteins from 11 million challenging protein-like decoys. This simplified fitness function correctly classified 408 native sequences (20 misclassifications, 95% correct rate), which outperforms several other statistical linear scoring function and optimized linear function. The performance is also comparable with results obtained from a far more complex nonlinear fitness function with > 5,000 terms. Our results further suggested that for the task of global sequence design of 428 selected proteins, the search space of protein shape and sequence can be effectively parametrized with just about 3,680 carefully chosen basis set of proteins and decoys, and we showed in addition that the overall landscape is not overly sensitive to the specific choice of this set. Our results can be generalized to construct fitness landscape. Chromosome Conformation Capture (3C)-based technologies are used to detect pairs of loci located on the same chromosome or on different chromosomes that are in close spatial proximity. There are some biases may affect the 3C-based experimental procedure, including the non-alternative primer design and the distance between restriction sites. To overcome these biases, we propose a general novel constrained self- avoiding chromatin (C-SAC) model to remove non-specific physical interactions and develop a sequential importance sampling algorithm to rebuild 3D chromatin structures based on 5C experiments, and apply this approach to the ENCODE region ENm008 α-globin gene domain on human chromosome 16 for the lymphoblastoid cell (GM12878) and the chronic myelogenous leukemia cell (K562). We successfully removed non-specific physical interactions from the 5C reads for both two cells by our random ensemble generated by C-SAC model. We found that α-globin gene domain is a compact globule in the GM12878 cell, and it is formed two separate domains in the K562 cell. We not only recover most of 5C indicated proximity interactions, but also find new proximity interactions which 5C experiments can not detect. We got 77% coverage interactions by comparing with ChIA-PET measurements. Based on the ensemble of the reconstructed 3D conformations, we also proposed one mechanism which may explain why α-globin gene is inactive in the GM12878 cell and active in the K562 cell.