posted on 2020-05-01, 00:00authored byMatteo Marziali
Searching and retrieving information efficiently represents an urgency in both ordinary and business tasks. With the introduction of high-performance storage systems and cloud tech- nologies, reporting on paper sheets resulted an obsolete practice. Hence, companies aimed at reducing this type of operations to digital processes, employing digital repositories to store files in order to gain in availability and resilience. In this situation, software that provide functionalities to allow direct and valid access to stored data are referred to as search engines.
Aiming at analyzing cognitive and keyword-based searching algorithms, the goal of this dissertation is to develop a domain-specific search engine capable to combine the two cited approaches. The rationale behind adopting these two techniques together has to be found in the necessity to overcome the lack of ’query context’ and ’intent understanding’, along with the inefficiency of common procedures in handling equal words carrying diverse meanings. To meet the desired objectives, Text Retrieval was carried out on a juridical domain by performing the search on a corpus of authentic legal documents from the Italian Court of Cassation.
We organized our work into two core activities: Document Processing and Text Retrieval, both integrated in the Search Engine pipeline. In particular, Text Retrieval has been per- formed on top of processing units expressly built for granting proper answers to literal and non-literal queries. During the Document Processing phase, significant effort has been destined to extracting texts from actual judicial documents, initially in the form of images of scanned documents. Text Retrieval, instead, concerned with the realization of a search engine pipeline
featuring diverse Deep Learning approaches. Such techniques involved the encoding of text portions into a more furbished representation by capturing both syntactic and semantic word features. Finally, the considered embedding approaches are compared by collecting the answers to specific questionnaires given to a random sample of people with the purpose of validating our approaches in a concrete use-case scenario.