Analysis of Survey Data with Non-Ignorable Missing Covariates
thesisposted on 08.02.2018 by Fima Lanra Fredrik Gerarld Langi
In order to distinguish essays and pre-prints from academic theses, we have a separate category. These are often much longer text based documents than a paper.
Missing data are common in survey sampling, which create a spectrum of inferential problems. In this thesis, a method to analyze survey data with potentially non-ignorable covariates is proposed. The approach is particularly developed to address the limitations in current routines of the standard statistical packages when, simultaneously, the model of interest has a mixture of categorical and continuous missing covariates, the analysis needs to incorporate the sampling design under different assumptions about its functional form, and there is a demand for manageable computation time in practical sense. The proposed method proceeds as a full likelihood procedure if the sampling probability function is known for all observations, but it becomes a quasi-likelihood approach when the quantity of survey weight is instead the only available information about sample selection. Three classes of survey data are considered during the development, including those of which none (Case 1), all (Case 2), or some (Case 3) of the covariates are observable outside the samples. Two situations are further defined on each of them, that is, whether the functional form of sample selection is known (Situation 1) or unknown (Situation 2). Given its construction, the proposed method, termed the augmentation assisted EM algorithm or simply the augmentation method, retains the desirable properties of the maximum likelihood estimates, while flexible enough to handle both continuous and categorical missing covariates, and can adapt the use of survey weight to improve inference. The simulation studies indicates that the proposed method performs reliably well across all classes of survey data. In terms of unbiasedness, it is competitive with and may occasionally outperform the multiple imputation by chained equations (MICE), a well-known technique in multiple imputation. Efficiency of its estimates are also comparable to MICE. In the real data application using the dataset from the Indonesia Demographic and Health Survey of 2012, the proposed method successfully estimates the demographic, health, and birth-related factors associated with the infant mortality. Most importantly, it is able to improve the results of complete case analyses by both correcting the magnitude of effect size and increasing the power of analysis to detect the variable significance.