posted on 2013-06-28, 00:00authored byTroy A. Hernandez
The process of transforming a sample to a pair of input and output vectors is sometimes referred to as ``vectorization''. Those samples and their respective vectorizations are used within various learning algorithms to create a model that makes predictions about unknown output vectors given known input vectors. This thesis aims to compare, generalize, and improve existing vectorizations within the fields of bioinformatics and transportation.
We extend the natural vector description of genomes to handle viruses and various issues unique to viral genomes. We provide an alternative definition of the the natural vector that is able to handle ambiguous nucleotides. We provide a bound on the distance induced by the natural vector between a genome and a mutation of that genome due to a single-nucleotide polymorphism.
We then present a new family of alignment-free vectorizations. This new alignment-free vectorization uses the frequency of genomic words, as is done in the composition vector, and incorporates descriptive statistics of those k-mers' positional information, as inspired by the natural vector. We provide a comparison of 5 popular characterizations of genome similarity using k-nearest neighbor classification, and evaluate these on two collections of viruses.
The prediction of bus arrival times is important for users of public transportation. We first generalize existing vectorizations and representations. We then propose a method of recovering the schedule and show that the use of this schedule uniformly improves all existing methods using 3 weeks of Chicago Transit Authority bus data.
Lastly, we analyze data usage from reporting real-time GPS traces. The problem of tracking a GPS device relies upon predicting vehicle location in general, as opposed to predicting vehicle location on fixed routes as above. Comparison of 12 different tracking methods are done on two data sets. We show that at low-error tolerances the methods are equivalent, but at higher-error tolerances the proposed method is greatly more efficient.