Variable Selection in Presence of Strong Collinearity with Application to Environmental Mixtures
thesisposted on 01.12.2020, 00:00 by Jiyeong Jang
Variable selection has become an essential element of high dimensional statistical modeling to yield parsimonious models while keeping high prediction accuracy. High dimensionality often induces collinearity problems. For instance, studies of environmental mixtures include a large number of pollutants that are strongly inter-correlated. Regularized variable selection methods such as LASSO are popular for statistical variable selection, however, these methods often do not perform well in presence of strong collinearity in terms of selection and prediction. To address these challenges a novel method, namely COrrelation LeaRNing for variable Selection (COLRNS), is developed that is based on iterative correlation learning for cluster detection and variable selection. The COLRNS is further extended to COLRNS Generalized Linear Model (COLRNS-GLM) to be applicable in a generalized linear regression setting. The performance of the methods is evaluated through an extensive set of simulations and real-world applications to environmental mixtures data. The results show that the methods effectively identify a set of influential predictors, improve prediction accuracy, and reduce error in parameter estimation in most simulation scenarios and data applications under strong collinearity in high dimensional data.