Multimodal learning addresses the challenges of deep networks in processing complex datasets containing multiple modalities, such as effectively integrating data with different structures and learning modality fusions between texts and images. Traditional deep learning models are primarily inductive in nature and often struggle to capture the full semantic richness and implicit dependencies within such datasets. In this thesis, we present four recent papers and address these challenges by integrating and processing information from different sources or modalities of data. This integration enriches the understanding of individual data points and enhances the model's generalization ability in real-world applications.
We first introduce a novel self-supervised ontology matching (OM) method to fuse linguistic information and the structural information inherent in the ontologies. We proposed capturing multiple structural contexts encompassing both local and global interactions between concepts in the ontologies. Our experiments on the Bio-ML datasets, which are publicly available from Ontology Alignment Evaluation Initiative (OAEI) and tasks, demonstrated that our method surpassed the state-of-the-art OM systems in terms of alignment quality and inference time. This OM method establishes well-structured knowledge bases and data integration that are crucial for facilitating further studies.
In the second paper, we addressed the challenge of specifying domain or prior modal knowledge in a backpropagation-friendly manner in large-scale and noisy settings, such as with large Vision-Language Models (VLMs). We proposed a simplified alternative of combining features from pre-trained deep networks and freely available semantic explicit knowledge. To remove irrelevant explicit knowledge that does not correspond well with the images, we introduce an implicit differentiable Out-of-Distribution (OOD) detection layer. This layer addresses outlier detection by solving for fixed points of a differentiable function and using the last iterate of fixed point solver to backpropagate. We pre-trained on three public datasets in the experiments, including COCO, Visual Genome, and SBU Captions. We used all public datasets for downstream datasets, including Flickr30k and COCO for image-text retrieval, VQAv2 and OKVQA for visual question answering (and ablation studies), and NLVR2 for visual reasoning.
Furthermore, we concentrated on optimizing neural network training and efficient tuning of multiple modalities. We focused on the landscape design of the logistic function and derived a novel sequence of strictly convex functions that are at least as strict as logistic loss. Our empirical analysis applies the proposed rooted logistic objective to multiple deep models across various classification benchmarks. These are conducted on the following four public datasets from the UCI machine learning repository: Wine, Ionosphere, Madelon, and Specheart. Our results illustrate that training with the rooted loss function converges faster and yields performance improvements. Furthermore, we demonstrate the applications of our novel rooted loss function in post-training quantization and generative modeling-based downstream applications, such as fine-tuning the StyleGAN model with the rooted loss. In particular, we trained on public datasets CIFAR-10/100 and fine-tuned on two public datasets including Tiny-ImageNet and Food-101. We use two public datasets for our image generation experiments with StyleGAN, including FFHQ and the Stanford Dogs. For the quantization of language generation tasks, we use three public datasets: WikiText2, Penn Treebank, and C4.
Lastly, we bridge vision foundation models and vision-language models on open-vocabulary segmentation. We propose a diffusion-based layer to combine features from different samples for segmentation. Our layer can be used to incorporate information from multiple modalities for segmentation purposes within a single pipeline with no recurrences or recursions in them. Our pipeline uses diffused vision foundation models and CLIP to inform features across samples for novel concept segmentation, such as in training-free cases. We then incorporate language modality within our framework using CLIP embeddings for cut guidance to enhance open-vocabulary semantic segmentation. Our empirical results show significant improvements on various public benchmark datasets, including Pascal VOC, Pascal Context, MS-COCO, ADE20K, and Cityscapes.
Future research directions are summarized in three key directions: efficient tuning of multimodal models, advancing the reasoning abilities of multimodal models, and applying VLMs to scientific data from different domains.
History
Advisor
Sathya Ravi
Department
Computer Science
Degree Grantor
University of Illinois Chicago
Degree Level
Doctoral
Degree name
PhD, Doctor of Philosophy
Committee Member
Sourav Medya
Natalie Parde
Brian Ziebart
Darvin Yi