Topic modeling algorithms promise to uncover the underlying semantics of large collections of documents, serving as an effective tool to help discover online knowledge. In this thesis, we apply topic modeling algorithms to solve two major tasks.
The first relates to identify diversionary comments under blog posts. Diversionary comments are defined as comments that divert the topic from the original post. A possible purpose is to distract readers from the original topic and draw attention to a new one. We categorize diversionary comments into 5 types based on our observations, and propose an effective framework to identify and flag them. Our approach combines coreference resolution, extraction from Wikipedia and topic modeling algorithms to help capture the underlying topics in each comment and the post. We solve the problem in two different ways: (i) rank all the comments in descending order of being diversionary; (ii) consider it as a classification problem, distinguishing diversionary comments from non-diversionary ones.
Secondly, we design a sense-topic model to induce the senses for ambiguous words in a corpus. Considering that sense and topic are related, but they are distinct linguistic phenomena, we treat sense and topic as two separate latent variables in our model. Topics are inferred by the entire document, while senses are inferred by the local context surrounding the ambiguous word. When relating the sense and topic variables, we take inspiration from dependency networks and draw a bidirectional edge between them. We also present unsupervised ways of enriching the original dataset, including using neural word embeddings and external Web-scale corpora to enrich the context of each data instance or to add more instances.