Distilling Trustworthy Knowledge from Crowdsourced Data
thesisposted on 18.10.2016, 00:00 by Sihong Xie
Crowdsourcing, a technique referring to sourcing data from a large crowd of human workers, has become an effective, efficient and scalable data collecting paradigm in domains like text and image tagging, spam detection, product rating and ranking, etc., that are easier for human beings than for computers. However, the crowdsourced data are usually noisy, incomplete, erroneous due to incompetence of crowdsourcing workers, malicious injection of false information, etc., leading to trustworthiness issues in the crowdsourced data. In this thesis, I explore the issues in two crowdsourcing settings: 1) crowdsourcing with a panel and 2) crowdsourcing in the wild. Under the first setting, I study the problems of worker competence estimation, to better sift out the less accurate workers and emphasize the input from more reliable ones. Then I handle the label correlation in multi-labeled crowdsourcing and propose models to jointly infer the label correlations and the more trustworthy multi-labeled annotations. I then propose a large margin based framework to find the best parameter space for distillation of trustworthy information from crowdsourced data. The situations are quite different when crowdsourcing information from a crowd in the wild, such as rating and ranking systems where a large number of unknown workers contribute their opinions. The challenges mainly come from malicious workers in the crowd and the goal is to detect and remove such workers. I propose a time series pattern mining based approach to collectively detect singleton spamming attacks, which are widely adopted by attackers due to significant financial incentive and the well-covered trails of attacks. I then study various biases in the crowdsourced ratings due to sample selection bias and subjectivity, and propose a transfer learning based iterative bias correction method that is efficient in terms of human supervision. Lastly, I propose a framework based on dimension reduction to detect the irrelevant text comments crowdsourced on social medias.