University of Illinois Chicago
Browse

Model Identification and Variable Selection for High-Dimensional Sparse Data

thesis
posted on 2023-08-01, 00:00 authored by Seyedeh Niloufar Dousti Mousavi
This dissertation is based on the results of four collaborative research projects and one R package including: “An R Package AZIAD for Analyzing Zero-Inflated and Zero-Altered Data”(2), “Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data” (4), “Model Selection and Regression Analysis for Zero-altered or Zero-inflated Data”, “A Trigamma-free Approach for Computing Information Matrices Related to Trigamma Function”, and “AZIAD: Analyz- ing Zero-Inflated and Zero-Altered Data” (5). In Chapter two of my dissertation, I focused on modeling sparse data using zero inflated and zero altered models with both discrete and continuous baseline distributions. I derived the formulas for the Fisher information matrix and developed a comprehensive R package called AIZAD for analyzing zero inflated and zero altered data. AIZAD can estimate maximum likelihood estimators for 27 different distributions and conduct KS tests using two different algorithms, one recommended for smaller sample sizes and another recommended for larger sample sizes. Additionally, it calculates Fisher information and confidence intervals for all parameters in the model. It performs model selection using the likelihood ratio test. I performed numerous simulation study in tems of the size of the test and the power of the test compared to other existing packages. To test the effectiveness of the package, I conducted analysis on two real data sets such as “DebTrivedi” data and omic data. The results showed that AIZAD is a powerful tool for selecting appropriate models through KS tests and model selection. One of my research interest is variable selection in high dimensional data which has numerous applications in computational statistics and computer science. In chapter 2 my primary research goal in this area is to investigate the performance of variable selection compared to dimension reduction in high dimensional data. I explored this subject with the application on (I) gene expression big data and (II) vaginal microbiome data. In this study, I proposed a new technique called “significance test on group labels”. The goal is to select the most informative covariates to predict the class labels. The procedure is as follows: first, we perform model selection for the covariates which is achieved by performing KS-test, and then we compute the Akaike information criterion (AIC) for each covariate. Second, for each label (assuming m classes), we perform model selection model and compute the AIC value. Then, a new aggregated AIC is calculated by summing up the AIC values from m classes. Third, we take the difference of the two AIC values with and without class labels. A larger difference indicates that the covariate is more informative for predicting the class labels. To test the accuracy of the proposed technique, two case studies with real data were conducted, as briefly elaborated below. As the first case study, I applied the proposed method on gene expression data. The RNA-seq gene expression dataset, is a high-dimensional dataset consisting of numerous genes of hundreds of patients that belong to different cancerous tumors. The data includes more than twenty thousand genes, many of which contain a high proportion of zeroes. Through applying the proposed method, I was able to rank the sparse genes based on their AIC differences. A critical question would be how many genes should be selected for predicting the class labels. To answer the question, I utilized a 1-nearest neighbor classifier with various numbers of selected genes to predict the class labels (cancer type) and found that the first significant 50 genes based on the smallest training error will be a good option. Also, in order to obtain a fair estimate for the prediction error, a 5-fold cross-validation was conducted. The best prediction error rate, 0, was attained at 50 genes. As shown in the article, the proposed method outperformed other clustering with dimension reduction methods. As the second case study, the proposed method was applied on a longitudinal vaginal microbiome dataset. The dataset available in (7) includes vaginal microbiome species of 32 non- pregnant women and 22 pregnant women who had a term delivery without complications. The purpose of the study was to characterize the changes in the composition of the vaginal microbiome (concentrating on Lactobacillus microbiome) during 38 weeks of pregnancy between two groups of women, pregnant or non-pregnant. As part of the data screening process, data imputation was used to address missing time points using the nearest neighbor along with linear interpolation. Then, the proposed method was applied to distinguish the significant change in the bacteria, Lactobacillus microbiome, over time. It was found that the two groups tend to be significantly different after week 22. To further investigate the differences between the two groups, before and after week 22 of pregnancy, we conducted a more detailed analysis of the estimated parameters over time. It was observed that Lactobacillus species are significantly smaller in the pregnant group compared to the non-pregnant group at the end of the pregnancy. This longitudinal study can be extended to any other vaginal microbiome species in the dataset as well. Chapter 4 of my dissertation focuses on regression analysis in zero-inflated and zero-altered models. These types of models are commonly used in statistics to handle data with an excess of zeros, which can occur in various fields such as ecology, epidemiology, and economics. The chapter includes an overview of zero-inflated and zero-altered regression models and how they can be fitted to data using statistical software. It contains the derivation formulas given different discrete distributions such as ZIBNB and ZIBB. It is tested on tow real data “DebTrivedi” data and “Insurance Claim” data. Basically the best model was found given 14 different regression models and 4 different link function. After selecting the best model for each data the corresponding fisher information and confidence interval for all the parameters in the model was calculated. Then variable selection was performed to find the most influential covariate for the corresponding model. It covers the interpretation of results and how to make inferences about the relationship between the response variable and the predictors in these models. Overall, the chapter provides a detailed exploration of a specialized area of regression analysis that can be extremely valuable for researchers working with count data. Also, a trigmama free approach was introduced. The motivation is to overcome this issue of difficulty and inefficiency of calculating the expectation of trigamma function by using Monte Carlo simulation. The introduces method is very efficient and fast and led to more accurate result. Finally, in chapter 5, I share my ideas and direction for future work in variable selection for high- dimensional data by performing categorical data analysis and a New R package that can perform regression analysis given different distribution.

History

Advisor

Yang, Jie

Chair

Yang, Jie

Department

Mathematics , Statistics, and Computer Science

Degree Grantor

University of Illinois at Chicago

Degree Level

  • Doctoral

Degree name

PhD, Doctor of Philosophy

Committee Member

Wang, Jing Pajda-De La O, Jennifer Han, Kyunghee Zhong, Ping-Shou Chen, Hua Yun

Submitted date

August 2023

Thesis type

application/pdf

Language

  • en

Usage metrics

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC