Advanced Information Retrieval Within Blogosphere and Micro-Blogosphere

Jia, Lifeng

JIA_LIFENG.pdf (2.11 MB)

Advanced Information Retrieval Within Blogosphere and Micro-Blogosphere

thesis

posted on 2014-02-24, 00:00 authored by Lifeng Jia

Social media gain worldwide popularity and their volumes are in rapid growth. Blogs and microblogs are two typical types of social media. By February 2011, there had been over 156 million public blogs in existence and the volume of blogosphere is predicted to double about every 5.5 months. Twitter, a microblogging service, has a daily volume of over 340 million tweets by 2012. With such an overwhelming amount of information in blogosphere and micro-blogosphere, advanced information retrieval techniques are needed. In this thesis, I tackle two advanced information retrieval problems: faceted blog distillation over blogosphere and real-time tweet ad-hoc retrieval over micro-blogosphere. For faceted blog distillation, users aim at retrieving the blogs that are not only relevant to queries but also exhibit some qualities. Six aspects of quality (called facets) are considered: opinionated vs. factual, personal vs. official and in-depth vs. shallow. Opinionated blogs provide the blog posts that contain relevant opinions to queries while factual blogs consist of the posts that describe the topics of queries without opinionated contents. The posts in personal blogs depict the topics related to the personal experiences of bloggers while those in official blogs deliver commercial purposes of bloggers. In-depth blogs provide deep analysis about the topics of interest while the posts in shallow blogs simply mention the topics, without analyzing the implications of the provided information. Faceted blog distillation consists of three sub-problems: opinionated and factual blog distillation, personal and official blog distillation, and in-depth and shallow blog distillation. For opinionated and factual blog distillation, I propose a classification-based method to identify the opinions relevant to queries in terms of syntax and semantics. For personal and official blog distillation, I propose two categories of methods: classification based and topic modeling based. All proposed methods effectively differentiate personal blog posts from official blog posts. For in-depth and shallow blog distillation, I propose a measurement to compute the query-oriented depth of blog posts. I also discuss the relationships among facets. The proposed techniques are evaluated by using 220 TREC 2006-2010 queries over two TREC collections: Blogs06 and Blogs08. The proposed methods significantly outperform the best known results for faceted blog distillation. For real-time tweet ad-hoc retrieval, users wish to see the tweets that are not only relevant to queries but also be most recent ones. I propose a two-phase approach to address this problem. Tweets can be categorized into two types. One type consists of short messages not containing any URL of a web page. The other type has at least one URL of a web page in addition to a short message. These two types of tweets have different structures. In the first phase, I propose a learning-to-rank method to rank tweets using the divide-and-conquer strategy to address the structural difference of tweets. In the second phase, I propose three novel categorizations of queries in terms of their temporal sensitivities; then I propose to calculate the time-related relevance scores of tweets according to the classified types of queries; finally I combine the time scores with the IR scores from the first phase to produce a ranking of tweets. Experimental results achieved by using the TREC 2011 and TREC 2012 queries over the TREC Tweets2011 collection show that the proposed divide-and-conquer method of ranking tweets yields better retrieval effectiveness than ranking them simultaneously and the proposed incorporation of temporal information into retrieval process yields further improvements. The method also compares favorably with state-of-the-art methods in retrieval effectiveness.

History

Advisor

Yu, Clement T.

Department

Computer Science

Degree Grantor

University of Illinois at Chicago

Degree Level

Doctoral

Committee Member

Liu, Bing Sistla, Prasad Wang, Jing Yu, Philip S.

Submitted date

2013-12

Language

en

Issue date

2014-02-24

Usage metrics

Keywords

Faceted Blog Distillation Real-time Tweet Ad-hoc Retrieval Information Retrieval

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Advanced Information Retrieval Within Blogosphere and Micro-Blogosphere

History

Advisor

Department

Degree Grantor

Degree Level

Committee Member

Submitted date

Language

Issue date

Usage metrics

Categories

Keywords

Licence

Exports