Natural Vector Method: Characterizing, Clustering and Phylogeny of DNA, Genome and Protein sequences
thesisposted on 15.04.2014, 00:00 by Mo Deng
With the development of biotechnology, more and more biological sequence information has been acquired. The number of sequences in GenBank has been growing exponentially in the past 20 years (http://www.ncbi.nlm.nih.gov). There are almost 8 million sequences in non-redundant (NR) database of protein sequences, including the complete genomes of 1800 different species. This large body of data is doubling in size every 28 months. Many computational and statistical methods for the comparison of biological sequences (DNA, genome or protein sequences) have been proposed. It still remains one of the most active and important research areas in bioinformatics and computational biology. Two different methodologies for studying the sequence comparison (i.e., the similarity of sequences) are known as alignment-based and alignment-free methods. Alignment-based method is widely used by scientists. However, the search for optimal solutions using sequence alignment-based methods is encountered with great difficulty in computational aspect with regard to large biological databases, especially when comparing three or more biological sequences at a time, i.e., multiple sequence alignment. In fact, multiple sequence alignment is an NP-hard problem. Therefore, it is very necessary to develop the alignment-free approaches to overcome the critical limitations of alignment-based methods. In chapter 2, we introduce an alignment-free approach, natural vector method, to characterize a biological sequence as a natural vector. We mathematically prove that the sequence and its natural vector are in one-to-one correspondence. This contribution allows us to embed the biological sequence space as a subspace of Euclidean space. More importantly, natural vector method is much faster and more accurate than the-state-of-the-art methods. Therefore, we can globally compare all the existing DNA, genome sequences within the space in a very short time whereas the conventional multiple alignment methods can never achieve it. In addition, the evolutionary properties of an unknown DNA, genome sequence can be predicted in the existing space by simply computing its associated natural vector. We conduct our method on new outbreak of A (H1N1) genes and genomes, human rhinovirus gnomes (HRV), mammalian mitochondrial genomes. The results indicate that natural vector method is much faster and more accurate than the-state-of-the-art methods in specifying the homology of DNA or genome sequences. Secondly, we use the natural vector method to construct the protein space, a subspace of Euclidean space in chapter 3. Similarly, we can prove the one-to-one correspondence between a protein sequence and its natural vector. In chapter 3, we use the natural vector method to reconstruct the phylogenetic tree for protein sequences. As illustration, protein kinase C (PKC) family and beta globin family sequences are tested based on their 60-dimensional natural vectors. In chapter 4, we introduce a novel method, ZC-method, to classify intron-less sequences from intron-containing sequences. This proposed method extends Z-curve method and improve the accuracy significantly. In applications, we test the original data of Z-curve method and another large dataset by using ZC-method and other state-of-the-art methods including Genscan, N-scan and Z-curve. The result shows the proposed ZC-method is more accurate than others.