posted on 2020-08-01, 00:00authored byHani Aldirawi
Sparse count data such as microbiome data, transcriptomics or RNA-seq data, or insurance claim data, are typically overdispersed and sparse with an exceeded number of zeros, which are
often challenging to be modeled.
In this dissertation work, we aim to answer two questions: (1) How do we identify the most appropriate probabilistic model for a given sparse data? (2) With available covariates, how do we build the most appropriate regression model for predicting a sparse response?
In response to the first question, we propose a statistical procedure for identifying the most appropriate discrete probabilistic models for zero-inflated or Hurdle models based on the bootstrapped p-values of a sequence of discrete Kolmogorov-Smirnov (KS) test. We develop a general procedure for estimating the parameters for a large class of zero-inflated models and Hurdle models. We also develop a bootstrapped likelihood ratio testing procedure based on Neyman-Pearson theorem for selecting the best model when there are more than one probabilistic model candidates.
We develop a new R package "iZID" as a software tool to facilitate potential users to answer the first question as well. For zero-inflated count data, we use bootstrapped Monte Carlo
procedure to control the bias issue in estimating the p-value of a KS Test, as well as bootstrapped likelihood ratio tests for zero-inflated model selection. Our package also provides some functions to simulate zero-inflated and hurdle count data and calculate maximum likelihood estimates
of unknown parameters. Compared with other R packages available so far, our package covers more types of zero-inflated and hurdle distributions and provides adjusted p-value estimates after incorporating the influence of unknown model parameters.
To answer the second question, we build a fairly general class of regression models, called
Zero-Inflated Regression Models (ZIRM), which not only cover currently available zero-inflated regression models, such as ZIP, ZINB with xed r, ZIBB with constant prior parameters, but also include new regression models, including ZINB with flexible r, ZIBB with flexible prior parameters, and ZIBNB. We also build the corresponding Hurdle Regression Models for zero altered responses. With the enriched model candidates, we perform model selection based on
AIC and BIC criteria. Our application to Insurance Claim Data shows that ZINB with flexible r is more appropriate than any others.
For general zero-inflated regression models, we derive and simply its general form of Fisher information matrix and then perform significance tests for variable selection. We compare the confidence intervals based on the Fisher information matrix with the ones built by bootstrapping.
The results are consistent with each other. Compared with the bootstrapping solutions, the variable selection based on Fisher information matrix is apparently more efficient. Nevertheless, we suggest the use of bootstrapping confidence intervals when the sample size is moderate
or small.
History
Advisor
Yang, Jie
Chair
Yang, Jie
Department
Mathematics, Statistics, and Computer Science
Degree Grantor
University of Illinois at Chicago
Degree Level
Doctoral
Degree name
PhD, Doctor of Philosophy
Committee Member
Yang, Min
Wang, Jing
Zhong, Ping-Shou
Chen, Hua Yun