Unsupervised Feature Selection for Heterogeneous Data
In the era of big data, one is often confronted with the problem of high-dimensional data in many data mining applications. Hence, feature selection has become an important technique since it can alleviate curse of dimensionality, speed up learning process and provide better interpretability. My Ph.D. research work focuses on unsupervised feature selection as class labels are usually expensive to obtain. In unsupervised feature selection, it is typically more challenging to evaluate the quality of features than its supervised counterpart due to the lack of guidance from class labels. We designed several new criteria, which have some desirable properties and can effectively identify discriminative features without using class labels. Moreover, due to better capability of data collection, data samples usually come in heterogeneous forms, such as networked data, multi-modal/multi-view data and data equipped with complex side information. Such heterogeneous information (e.g., network structure and additional views) can be highly useful when class labels are not available. In this dissertation, we design algorithms for such heterogeneous data to effectively select high-quality features. Through the above research work, we believe our models provide new perspectives on unsupervised feature selection and address the challenges posed by the heterogeneity of big data.