University of Illinois Chicago
Browse

Model Selection and Regression Analysis for Sparse Discrete Data

thesis
posted on 2020-08-01, 00:00 authored by Hani Aldirawi
Sparse count data such as microbiome data, transcriptomics or RNA-seq data, or insurance claim data, are typically overdispersed and sparse with an exceeded number of zeros, which are often challenging to be modeled. In this dissertation work, we aim to answer two questions: (1) How do we identify the most appropriate probabilistic model for a given sparse data? (2) With available covariates, how do we build the most appropriate regression model for predicting a sparse response? In response to the first question, we propose a statistical procedure for identifying the most appropriate discrete probabilistic models for zero-inflated or Hurdle models based on the bootstrapped p-values of a sequence of discrete Kolmogorov-Smirnov (KS) test. We develop a general procedure for estimating the parameters for a large class of zero-inflated models and Hurdle models. We also develop a bootstrapped likelihood ratio testing procedure based on Neyman-Pearson theorem for selecting the best model when there are more than one probabilistic model candidates. We develop a new R package "iZID" as a software tool to facilitate potential users to answer the first question as well. For zero-inflated count data, we use bootstrapped Monte Carlo procedure to control the bias issue in estimating the p-value of a KS Test, as well as bootstrapped likelihood ratio tests for zero-inflated model selection. Our package also provides some functions to simulate zero-inflated and hurdle count data and calculate maximum likelihood estimates of unknown parameters. Compared with other R packages available so far, our package covers more types of zero-inflated and hurdle distributions and provides adjusted p-value estimates after incorporating the influence of unknown model parameters. To answer the second question, we build a fairly general class of regression models, called Zero-Inflated Regression Models (ZIRM), which not only cover currently available zero-inflated regression models, such as ZIP, ZINB with xed r, ZIBB with constant prior parameters, but also include new regression models, including ZINB with flexible r, ZIBB with flexible prior parameters, and ZIBNB. We also build the corresponding Hurdle Regression Models for zero altered responses. With the enriched model candidates, we perform model selection based on AIC and BIC criteria. Our application to Insurance Claim Data shows that ZINB with flexible r is more appropriate than any others. For general zero-inflated regression models, we derive and simply its general form of Fisher information matrix and then perform significance tests for variable selection. We compare the confidence intervals based on the Fisher information matrix with the ones built by bootstrapping. The results are consistent with each other. Compared with the bootstrapping solutions, the variable selection based on Fisher information matrix is apparently more efficient. Nevertheless, we suggest the use of bootstrapping confidence intervals when the sample size is moderate or small.

History

Advisor

Yang, Jie

Chair

Yang, Jie

Department

Mathematics, Statistics, and Computer Science

Degree Grantor

University of Illinois at Chicago

Degree Level

  • Doctoral

Degree name

PhD, Doctor of Philosophy

Committee Member

Yang, Min Wang, Jing Zhong, Ping-Shou Chen, Hua Yun

Submitted date

August 2020

Thesis type

application/pdf

Language

  • en

Usage metrics

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC