posted on 2016-11-05, 00:00authored byVenkat Raghavan Ganesh Sekar
The web contains huge amount of semi-structured data in the form of tables and spreadsheets that are pertinent for various statistical data analysis or visualization. Manual processing of these tabular data is tedious because of their heterogeneity in structure, concept and metadata. Further, much of the information present in them do not have explicit metadata introducing difficulties in understanding the table semantics which is critical to automatically process these data and to leverage the data integration process. In this thesis, we (a) study in-depth about semi-structured (tabular) data on the web; (b) discuss the complexities in processing them; (c) propose automatic methods to abstract their semantics by annotating various features inside the tables; (d) introduce algorithms to construct a semantic graph by resolving different levels of heterogeneities. We evaluate our approach on a set of highly complex tables retrieved from different domains and also discuss about the impact of our work in practical scenarios and in the field of Semantic Web.