INDIGO Home University of Illinois at Urbana-Champaign logo uic building uic pavilion uic student center

Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique

Show full item record

Bookmark or cite this item:

Files in this item

File Description Format
PDF 1471-2105-11-S1-S6.pdf (528KB) (no description provided) PDF
Title: Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique
Author(s): Bhardwaj, Nitin; Gerstein, Mark; Lu, Hui
Subject(s): peripheral proteins supervised learning
Abstract: Background In supervised learning, traditional approaches to building a classifier use two sets of examples with pre-defined classes along with a learning algorithm. The main limitation of this approach is that examples from both classes are required which might be infeasible in certain cases, especially those dealing with biological data. Such is the case for membrane-binding peripheral domains that play important roles in many biological processes, including cell signaling and membrane trafficking by reversibly binding to membranes. For these domains, a well-defined positive set is available with domains known to bind membrane along with a large unlabeled set of domains whose membrane binding affinities have not been measured. The aforementioned limitation can be addressed by a special class of semi-supervised machine learning called positive-unlabeled (PU) learning that uses a positive set with a large unlabeled set. Methods In this study, we implement the first application of PU-learning to a protein function prediction problem: identification of peripheral domains. PU-learning starts by identifying reliable negative (RN) examples iteratively from the unlabeled set until convergence and builds a classifier using the positive and the final RN set. A data set of 232 positive cases and ~3750 unlabeled ones were used to construct and validate the protocol. Results Holdout evaluation of the protocol on a left-out positive set showed that the accuracy of prediction reached up to 95% during two independent implementations. Conclusion These results suggest that our protocol can be used for predicting membrane-binding properties of a wide variety of modular domains. Protocols like the one presented here become particularly useful in the case of availability of information from one class only.
Issue Date: 2010-01-18
Publisher: BioMed Central
Citation Info: Bhardwaj, N., Gerstein, M., & Lu, H. 2010. Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique. BMC Bioinformatics, 11 Suppl 1: S6. DOI: 10.1186/1471-2105-11-S1-S6
Type: Article
Description: This is an open access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The original source for this publication is at BioMed Central; DOI: 10.1186/1471-2105-11-S1-S6
ISSN: 1471-2105
Sponsor: The work is supported by NIH grant P01AI060915 to H.L.
Date Available in INDIGO: 2011-05-05

This item appears in the following Collection(s)

Show full item record


Country Code Views
United States of America 299
China 150
Russian Federation 31
United Kingdom 12
Germany 9


My Account


Access Key