posted on 2013-11-22, 00:00authored byJunhui Wang, Yixin Fang
Presence-only data occur in classification, which consist of a sample of observations
from presence class and a large number of background observations with unknown
presence/absence. Since absence data are generally unavailable, conventional semisupervised
learning approaches are no longer appropriate as they tend to degenerate
and assign all observations to presence class. In this article, we propose a generalized
class balance constraint, which can be equipped with semi-supervised learning approaches
to prevent them from degeneration. Furthermore, to circumvent the difficulty
of model tuning with presence-only data, a selection criterion based on classification
stability is developed, which measures the robustness of any given classification algorithm
against the sampling randomness. The effectiveness of the proposed approach
is demonstrated through a variety of simulated examples, along with an application to
gene function prediction.
History
Publisher Statement
NOTICE: This is the author’s version of a work that was accepted for publication in Computational Statistics and Data Analysis. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Computational Statistics and Data Analysis, Vol 59, (2012) DOI: 10.1016/j.csda.2012.10.007