Geometric Approximation Algorithms for Metric Learning and Poisoning
thesis
posted on 2023-05-01, 00:00authored byDiego Hisashi Ihara Centurion
Machine learning tasks such as nearest neighbor and clustering depend on a good representation of the input data. A good representation is one where distance quantifies dissimilarity between pairs of data points. Metric learning aims to find such good representation by mapping input data to points in some metric space. However, the input data can be contaminated by the presence of adversarial data. For this reason, a desirable property of machine learning algorithms is robustness in the presence of such adversarial data. Several algorithms that are robust to small perturbations in the input data have been proposed. However, it is generally not clear how to provably compare the robustness of these algorithms.
In this dissertation, we present approximation algorithms for learning Mahalanobis metric spaces even when a small fraction of the data set is adversarial, with theoretical guaranties. To that end, we consider a formulation of Mahalanobis metric learning as an optimization problem, where the objective is to minimize the number of violated similarity/dissimilarity constraints. We show that for any fixed ambient dimension, there exists a fully polynomial-time approximation scheme (FPTAS) with nearly-linear running time.
Next, we present efficient algorithms for poisoning data sets against k-center and k-means clustering. Formally, given a set of points, $X$, our algorithms compute a small set of poison points, $P$, such that the performance of the chosen clustering method degrades as much as possible on $X\cup P$, compared to the original data set, $X$. We obtain efficient poisoning approximation algorithms that achieve $\frac{1}{6}$-approximation for $k$-center clustering and $(\Omega(1/k^2),2)$-approximation for $k$-means clustering compared to optimal poisoning.We corroborate our theoretical results with experimental evidence that our methods can be used to evaluate the robustness of clustering algorithms.
Finally, we present an algorithm for poisoning datasets against k-NN classification by flipping a small portion of the labels through the application of multi-scale random partitions.