Advanced Information Retrieval Within Blogosphere and Micro-Blogosphere
thesisposted on 2014-02-24, 00:00 authored by Lifeng Jia
Social media gain worldwide popularity and their volumes are in rapid growth. Blogs and microblogs are two typical types of social media. By February 2011, there had been over 156 million public blogs in existence and the volume of blogosphere is predicted to double about every 5.5 months. Twitter, a microblogging service, has a daily volume of over 340 million tweets by 2012. With such an overwhelming amount of information in blogosphere and micro-blogosphere, advanced information retrieval techniques are needed. In this thesis, I tackle two advanced information retrieval problems: faceted blog distillation over blogosphere and real-time tweet ad-hoc retrieval over micro-blogosphere. For faceted blog distillation, users aim at retrieving the blogs that are not only relevant to queries but also exhibit some qualities. Six aspects of quality (called facets) are considered: opinionated vs. factual, personal vs. official and in-depth vs. shallow. Opinionated blogs provide the blog posts that contain relevant opinions to queries while factual blogs consist of the posts that describe the topics of queries without opinionated contents. The posts in personal blogs depict the topics related to the personal experiences of bloggers while those in official blogs deliver commercial purposes of bloggers. In-depth blogs provide deep analysis about the topics of interest while the posts in shallow blogs simply mention the topics, without analyzing the implications of the provided information. Faceted blog distillation consists of three sub-problems: opinionated and factual blog distillation, personal and official blog distillation, and in-depth and shallow blog distillation. For opinionated and factual blog distillation, I propose a classification-based method to identify the opinions relevant to queries in terms of syntax and semantics. For personal and official blog distillation, I propose two categories of methods: classification based and topic modeling based. All proposed methods effectively differentiate personal blog posts from official blog posts. For in-depth and shallow blog distillation, I propose a measurement to compute the query-oriented depth of blog posts. I also discuss the relationships among facets. The proposed techniques are evaluated by using 220 TREC 2006-2010 queries over two TREC collections: Blogs06 and Blogs08. The proposed methods significantly outperform the best known results for faceted blog distillation. For real-time tweet ad-hoc retrieval, users wish to see the tweets that are not only relevant to queries but also be most recent ones. I propose a two-phase approach to address this problem. Tweets can be categorized into two types. One type consists of short messages not containing any URL of a web page. The other type has at least one URL of a web page in addition to a short message. These two types of tweets have different structures. In the first phase, I propose a learning-to-rank method to rank tweets using the divide-and-conquer strategy to address the structural difference of tweets. In the second phase, I propose three novel categorizations of queries in terms of their temporal sensitivities; then I propose to calculate the time-related relevance scores of tweets according to the classified types of queries; finally I combine the time scores with the IR scores from the first phase to produce a ranking of tweets. Experimental results achieved by using the TREC 2011 and TREC 2012 queries over the TREC Tweets2011 collection show that the proposed divide-and-conquer method of ranking tweets yields better retrieval effectiveness than ranking them simultaneously and the proposed incorporation of temporal information into retrieval process yields further improvements. The method also compares favorably with state-of-the-art methods in retrieval effectiveness.