posted on 2013-11-14, 00:00authored byWei Sun, Junhui Wang, Yixin Fang
K-means clustering is a widely used tool for cluster analysis
due to its conceptual simplicity and computational efficiency. However, its
performance can be distorted when clustering high-dimensional data where
the number of variables becomes relatively large and many of them may
contain no information about the clustering structure. This article proposes
a high-dimensional cluster analysis method via regularized k-means clus-
tering, which can simultaneously cluster similar observations and eliminate
redundant variables. The key idea is to formulate the k-means clustering in a
form of regularization, with an adaptive group lasso penalty term on cluster
centers. In order to optimally balance the trade-off between the clustering
model fitting and sparsity, a selection criterion based on clustering stabil-
ity is developed. The asymptotic estimation and selection consistency of
the regularized k-means clustering with diverging dimension is established.
The effectiveness of the regularized k-means clustering is also demonstrated
through a variety of numerical experiments as well as applications to two
gene microarray examples. The regularized clustering framework can also
be extended to the general model-based clustering.