posted on 2021-05-01, 00:00authored byAbhijeet Mohanty
Unlike mass media news articles, research papers have a well-defined structure and the language is much more formal. Many words have specific meanings and these meanings do not change based on the context in which these words are used (e.g., repository always designates a kind of storage in a source code control system).
We theorize that with a limited manual effort these research papers can be turned into a semantic graph where new relations between terms can be obtained. Moreover, with vocabulary expansion, these terms can be linked to other concepts (e.g., repository -> storage -> memory) thereby enhancing the power of inferential reasoning about new relations. We also explore the possibility of finding contradictions among established relations hence raising new research questions about deeper explorations of already obtained relations.
Therefore, in this thesis, we address the problem of automatically creating inferences from the corpus of data published in research papers in the area of empirical software engineering.
To this end, we build an MVC-based application called "Generator of Research Output Units in Software Engineering" a.k.a. GROUSE which elicits the user to define relations between terms belonging to the corpus of 603 research papers published across various editions of Mining Software Repositories' conferences. With limited manual effort, we first create a ground truth conceptual graph upon which our application employs techniques such as Expected Entropy Loss (EEL), Latent Semantic Analysis (LSA), Relational Topic Modelling (RTM) and term expansion to infer new relations for semantically similar terms, terms with similar underlying associations and papers which are related depending on the terms they
contain.
With GROUSE, the empirical software engineering research community can collaborate on automatically generating new research questions that reveal deeper insights into software engineering processes and solutions.