Statistics and Its Interface
Volume 14 (2021)
Online multiple learning with working sufficient statistics for generalized linear models in big data
Pages: 403 – 416
The article proposes an online multiple learning approach to generalized linear models (GLMs) in big data. The approach relies on a new concept called working sufficient statistics (WSS), formulated under traditional iteratively reweighted least squares (IRWLS) for maximum likelihood of GLMs. Because the algorithm needs to access the entire data set multiple times, it is impossible to directly apply traditional IRWLS to big data. To overcome the difficulty, a new approach, called one-step IRWLS, is proposed under the framework of the online setting. The work investigates two methods. The first only uses the current data to formulate the objective function. The second also uses the information of the previous data. The simulation studies show that the results given by the second method can be as precise and accurate as those given by the exact maximum likelihood. A nice property is that one-step IRWLS successfully avoids the memory and computational efficiency barriers caused by the volume of big data. As the size of the WSS does not vary with the sample size, the proposed approach can be used even if the size of big data is much higher than the memory size of the computing system.
big data, generalized linear models, one-step IRWLS, online multiple learning, parallel computation, working sufficient statistics
2010 Mathematics Subject Classification
Primary 62F10, 62J12. Secondary 62E20.
Received 6 July 2019
Accepted 18 December 2020
Published 8 July 2021