We propose a fast and efficient strategy, called the representative approach, with linear models and generalized linear models for big data analysis, and in particular for distributed dataset.
With a given partitioning of big dataset, this approach constructs a representative data point for each data block and fits the target model on the representative dataset. In terms of time complexity, it is as fast as the subsampling approaches in the literature. As for effi- ciency, its accuracy of estimated parameters appears to be better than the divide-and-conquer method. Additionally, the representative approach is especially useful when analyzing massive data distributed stored on different nodes, since the generation of representatives is conditional independent. Overall, we recommend two representative approaches, mean representative (MR) and score-matching representative (SMR), along with theoretical justifications, for big data analysis with generalized linear models.
Comprehensive simulation studies confirm that MR is a good solution for linear models and pre-analysis for GLMs, while SMR outperforms the subsampling and divide-and-conquer methods, even with moderate size of block, for general GLMs. With properly chosen data partition, SMR estimate appears to be even comparable with the full data estimate. Using the Airline on-time performance data as an illustrative real big data example, we show that MR and SMR are as good as the full data estimate when available.
For GLMs with flat inverse link functions and moderate coefficients of the continuous vari- ables, we recommend MR. Otherwise, we recommend SMR solution with MR as an initial step with a finer partition.
History
Advisor
Yang, Jie
Chair
Yang, Jie
Department
Department of Mathematics, Statistics, and Computer Science