Analysis of presence-only data via semi-supervised learning approaches
journal contributionposted on 2013-11-22, 00:00 authored by Junhui Wang, Yixin Fang
Presence-only data occur in classification, which consist of a sample of observations from presence class and a large number of background observations with unknown presence/absence. Since absence data are generally unavailable, conventional semisupervised learning approaches are no longer appropriate as they tend to degenerate and assign all observations to presence class. In this article, we propose a generalized class balance constraint, which can be equipped with semi-supervised learning approaches to prevent them from degeneration. Furthermore, to circumvent the difficulty of model tuning with presence-only data, a selection criterion based on classification stability is developed, which measures the robustness of any given classification algorithm against the sampling randomness. The effectiveness of the proposed approach is demonstrated through a variety of simulated examples, along with an application to gene function prediction.