Representative Approach for Big Data Dimension Reduction with Binary Responses
thesisposted on 01.05.2020, 00:00 by Xuelong Wang
Sufficient dimension reduction (SDR) reduces the data dimensionality without specifying a regression model. Since it was first introduced by Li, 1991, SDR has been popular and many SDR methods have been proposed and studied. Among those methods, we focus on Sliced Inverse Regression (SIR) and Sliced Average Variance Estimation (SAVE), which are inverse-moment based methods. Those methods work well with continuous responses, but not with binary cases due to the limited number of levels of the response. In order to solve the issue, Shin et al., 2014 have proposed a solution for SDR methods on binary data called Probability Enhanced SDR (PRE-SDR). The PRE-SDR works well under a binary dataset. But it becomes time-consuming when a dataset is large, e.g., N > 10000, because of its computational intensity. In this thesis, motivated by the existing solution and its limitation on large data, we investigate and improve the SIR and SAVE from different perspectives. Firstly, we incorporate an online algorithm, which helps to reduce the usage of computer memory when a dataset is large. The general idea of this method is to scan the data chunk by chunk, calculate intermediate statistics, and combine intermediate results to get the final result. We develop online algorithms for SIR and SAVE and show that the online method’s result is the same as it calculated from using the full data at once. Besides, we enhance those algorithms with a parallel computation framework so that it could process multiple chunks at the same time. Simulation results suggest that the online algorithm reduces the computational time at least by 3-5 times compared with the original methods. Secondly, we propose a novel SDR approach, named as Mean Representative approach (MRDR), for binary responses. The main idea is to partition the data into blocks, calculate representatives for each block, and use the representatives as our new dataset for the following SDR analysis. By converting a block of data points into a representative data point, the corresponding binary responses become continuous, and the size of the data is reduced significantly because the number of the block is much smaller than the original observations. Therefore, the proposed representative approach provides an ideal solution for large data dimension reduction and can be incorporated with the classical SDR approaches naturally. The details of MRDR are introduced and discussed in Chapters 1 and 3. We study the asymptotic properties of MRDR in Chapter 4 and show that the proposed approach can recover the central subspace better than SIR and SAVE. Besides, we also discuss the optimal choice of the number of blocks in Section 4.3. The simulation studies in Chapter 5 verify the advantage of the proposed method over the original SIR and SAVE in estimating the central subspace and demonstrates the time efficiency compared to PRE-SIR. In the end, we apply the proposed method on the Electrical Grid Stability (EGS) data and simulated data based on the EGS data. The result shows the advantage of the proposed method over the several existing methods on sufficient dimension reduction with large data.