Underpinning the success of deep learning is effective structural prior modeling schemes that allow a broad range of domain-specific knowledge in data to be naturally encoded in a deep learning architecture. For example, in the computer vision community, convolutional neural networks implicitly encode transformation invariances (e.g., rotation and translation) by learning shareable weights across spatial domain of images. For sequential data, such as natural language sentences and speech utterances, recurrent neural networks are another class of architectures that perceive sequential order and capture the dependence among inputs. Besides advanced network architecture, one of the most prevalent approach to incorporating structural priors is regularization, which usually results in a complex non-convex optimization problem and creates contention between performance of end tasks and faithful of regularization.
We argue in this thesis that optimization methods provide an expressive set of primitive operations that allow us to integrate structural priors into the modeling pipeline without interference the learning of end tasks. We first propose inserting proximal mapping as a hidden layer to the deep neural network, which directly and explicitly produces well regularized hidden layer outputs. The resulting technique is shown well connected to kernel warping and dropout, and novel algorithms were developed for robust temporal learning and multiview learning. Next, we extend our framework to learn well regularized functions which project given inputs to structured outputs. As an instantiation of this approach, we aim to solve an unsupervised domain adaptation problem in which the minimax game leads to the training process unstable. A bi-level optimization based approach was proposed to decouple the minimax optimization so that the model enjoys a much more principled and efficient training procedure. In addition, our method warping probability discrepancy measures towards the end tasks by leveraging the pseudo-labels produced by the optimal predictor.
We validate our proposed methods through extensive experiments including image classification, speech recognition, cross-lingual word embedding, and domain adaptation. Our methods demonstrate a number of benefits over other baseline methods as we achieved state-of-the-art performance in various supervised and unsupervised learning tasks.
History
Advisor
Zhang, Xinhua
Chair
Zhang, Xinhua
Department
Computer Science
Degree Grantor
University of Illinois at Chicago
Degree Level
Doctoral
Degree name
PhD, Doctor of Philosophy
Committee Member
Yu, Philip S
Ziebart, Brian
Hu, Mengqi
White, Martha