Crowdsourcing, a technique referring to sourcing data from a large crowd of human workers, has become an effective, efficient and scalable data collecting paradigm
in domains like text and image tagging, spam detection, product rating and ranking, etc., that are easier for human beings than for computers.
However, the crowdsourced data are usually noisy, incomplete, erroneous due to incompetence of crowdsourcing workers, malicious injection of false information, etc.,
leading to trustworthiness issues in the crowdsourced data.
In this thesis, I explore the issues in two crowdsourcing settings: 1) crowdsourcing with a panel and 2) crowdsourcing in the wild.
Under the first setting, I study the problems of worker competence estimation,
to better sift out the less accurate workers and emphasize the input from more reliable ones.
Then I handle the label correlation in multi-labeled crowdsourcing
and propose models to jointly infer the label correlations and the more trustworthy
multi-labeled annotations.
I then propose a large margin based framework to find the best parameter space for distillation
of trustworthy information from crowdsourced data.
The situations are quite different when crowdsourcing information
from a crowd in the wild, such as rating and ranking systems where a large number
of unknown workers contribute their opinions.
The challenges mainly come from malicious workers in the crowd and the goal
is to detect and remove such workers.
I propose a time series pattern mining based approach to collectively detect singleton spamming attacks,
which are widely adopted by attackers due to significant financial incentive
and the well-covered trails of attacks.
I then study various biases in the crowdsourced ratings due to sample selection bias and subjectivity,
and propose a transfer learning based iterative bias correction method that is efficient in terms of human supervision.
Lastly, I propose a framework based on dimension reduction to detect the irrelevant text comments crowdsourced on social medias.