Integrating Image Features to Support the Biocuration Workflow
thesis
posted on 2023-12-01, 00:00authored byJuan Trelles Trabucco
Images in scientific publications communicate essential information, such as experiments performed and results obtained, that researchers can use as a proxy for the publication’s relevance to their research interests. Domains like biocuration, where biomedical experts analyze the literature to extract and organize information for future retrieval and analysis, can benefit from computationally leveraging these images in their workflows. Prior work in biomedical document classification for biocuration has shown that combining features from images (e.g., acquisition modality) with text features can improve precision and recall. However, the considerations and benefits obtained from computationally integrating such image-based features in biocuration workflows are less clear.
Based on a multi-year collaboration with text-mining researchers and biocurators, this dissertation identifies and tackles several challenges to computationally using image-data in the biocuration process. These challenges include the scarcity of labeled datasets for training image classifiers, the lack of characterization for the needed modality classes in biocuration, the lack of available labeling systems to label extensive collections of images, and the lack of support for image-based features in academic search engines. This dissertation addresses these issues by proposing two taxonomies for image modalities found in biomedical publications; introducing training strategies using shallower neural networks and hierarchical organization of classifiers; describing two labeling systems, one for domain experts and one for model builders who require support for data understanding; and concludes by merging several building blocks into document search systems that successfully leverage image-based data.
History
Advisor
Georgeta Elisabeta Marai
Department
Computer Science
Degree Grantor
University of Illinois Chicago
Degree Level
Doctoral
Degree name
PhD, Doctor of Philosophy
Committee Member
Andrew Johnson
Wei Tang
Cecilia Arighi
Steven M. Drucker