posted on 2025-05-01, 00:00authored byUsama Muneeb
Given a very good predictive model that uses a restricted feature set, what is the best way to incorporate it into a large full-featured model? There are two main scenarios where this problem can arise: (a) we have ample amounts of data with restricted features or (b) the restricted feature model is easier to learn with existing data. Here are some relevant cases. For (a), while training a logistic regression model most data may be missing many features for privacy reasons or while training an MDP policy with expensive data (with full sensing), we may have a past model built using a lot of cheap data (partial sensing) available. For (b), it is common to augment LM data with smaller language models, such as N-grams, and these are assumed to be reliably buildable from the same data. We discuss prior works that have used restricted models in the training of full-featured models using implicit or explicit regularization and we reveal their caveats. To solve these caveats, we propose our methodology, Induced Model Matching (IMM), that aligns the context-restricted, or induced, version of the large model with the restricted model. We show that correctly incorporating the restriction is crucial to have consistency in the limit (theoretically) and to achieve better performance with finite samples (experimentally) than the past approaches. Namely, these past approaches are (1) noising, which is implicit in addressing the problem, and (2) reverse knowledge distillation from weak teachers, which is explicit. These past approaches do not exploit the restriction being the nature of the weakness and can be problematic in terms of consistency. We demonstrate the merits of IMM using logistic regression as a proof of concept. We then apply it in language modeling (the application that initially inspired it) and demonstrate it on both LSTM and transformer full models, using bigrams as restricted models. We lastly give a simple RL example, which shows that POMDP policies can help learn better MDP policies. The IMM principle is thus generally applicable in common scenarios where restricted data is cheaper to collect or restricted models are easier to learn.
History
Advisor
Mesrob Ohannessian
Department
Electrical and Computer Engineering
Degree Grantor
University of Illinois Chicago
Degree Level
Doctoral
Degree name
PhD, Doctor of Philosophy
Committee Member
Brian Ziebart
Natalie Parde
Shuo Han
Ahmet Enis Cetin