We compared the advantages and disadvantages of alignment-based and alignment-free sequences analysis methods. We analyzed and classified all single-segmented viruses reference sequences by the natural vector method. Natural graphs of each Baltimore groups are displayed, which showed different family and genus classes are separated clearly. We derived the distance matrix of multiple segmented viruses, through Hausdorff distance and natural vector. West Nile virus and Influenza viruses were included in the dataset and they are classified in the correct family and genus by natural vector.
Based on previous work, we applied natural vectors on ebola viruses of the 2014 outbreak. The accuracy rates of family and genus labels classification are as high as 100\%. We also display the phelogenetic relationship between species of EBOV by their whole genome sequences and 7 proteins (Nucleoprotein (NP), VP35, VP40, Glycoprotein (GP), VP30, VP24, and RNA polymerase (L)). The phylogenetic trees indicate that VP24 is the most consistent to the variation of virulence, suggesting VP24 is a pharmaceutical target for treating or preventing the Ebola virus.
Based on a Markov Model, we proposed a new alignment-free sequences analysis method, the Q-vector. It keeps the sequence length information and reflects the relation between lower mers and higher mers. After applying the Q-vector, k-mer method and composition vector to classify the viruses’ reference sequences, Q-vector displays big advantages in both effectiveness and accuracy. By combining the distance matrix derived through Q-vector and natural vector method, we defined a distance matrix, which lowest the classification error to its smallest. Based on this new distance, we display the phylogenetic trees.
We built a virus database called VirusDB (http://mathlab.math.uic.edu/dev (for users in USA)) or (http://r720.math.tsinghua.edu.cn/VirusDB (for users in China) ) and an online system to serve those people who are interested in virus classification and prediction based on the natural vector method. The database stores the nucleotide sequences, natural vectors, and classification information of the single-segmented and multiple-segmented referenced viruses which were downloaded from NCBI. The online inquiry system serves the purpose of computing natural vectors and their distances in between of sequences, providing backend processes for automatic and manual updating of database content to synchronize with the GenBank copy, and providing online interface for accessing and using the database for classification and prediction.