posted on 2013-10-24, 00:00authored byXiaoxiao Shi
With the rapid growth of big data mining, multiple related data sources containing different types
of features may be available for a given task. For instance, users’ profiles can be used to build recommendation
systems; in addition, a model can also use users’ historical behaviors and social networks to
infer users’ interests on related products. We argue that it is desirable to collectively use any available
multiple heterogeneous data sources in order to build effective learning models. We call this framework
heterogeneous learning.
There are mainly two challenges in heterogeneous learning as follows:
(1) Learning from data with different statistical properties. For example, the data from different
data sources violate the iid assumption, or the data from different sources have different
feature spaces, or the data have different prediction labels (different posterior),
or the combination of the above cases.
(2) Learning from data with different structures. For example, some of the data sources contain traditional
vector-based features (e.g., user profiles), while others are graph relational data (e.g., social
networks), or the data sources are chemical graphs with different structures.
In this thesis, we explore the above challenges from the views of supervised learning, unsupervised
learning and feature projection respectively, and apply them to solve real world problems. These real world applications include drug efficiency prediction, document classification,
image classification, movie rating prediction, chemical graph classification, collective
classification, and several datasets from the UCI database. It shows that heterogeneous learning improves the learning accuracy significantly in some applications. For example, in the task of drug efficiency prediction, heterogeneous learning can reduce the error rate by over 50% by using a projection approach.