posted on 2018-11-27, 00:00authored byMassimo Piras
Due to the massive success of social media, online user-generated content has increased exponentially in the last years. Twitter, as a microblogging platform, allows users to share information about their opinions or activities by means of short posts called tweets. However, opinion spammers see social networks like Twitter as an opportunity to propagate their ideas, promoting or discrediting some target product or service, without showing their true intentions. In this study, we focused on detecting suspicious users who posted dubious claims about cancer treatment and prevention on Twitter. We addressed the task with a supervised learning approach, a binary classification problem in which we had to predict whether users were suspicious or genuine. We collected a set of 60 thousand tweets related to cancer posted in October 2017, including more than 36 thousand users. Since manual labeling could be a very complicated process, we elaborated a set of features for each user, both related to the content of her posts and her behavior on Twitter, and combined them to compute a spam score. The basic idea was that suspicious users would have different feature distributions with respect to genuine users and that would help us to separate the two classes. Then, we generated a ranking using the spam score and exploited it to assign the labels.
Finally, we ran a few classifiers on our labeled data, showing that suspicious users had different textual and behavioral patterns which could be used to distinguish them from genuine ones.