Statistics and Its Interface

Volume 9 (2016)

Number 4

Special Issue on Statistical and Computational Theory and Methodology for Big Data

Guest Editors: Ming-Hui Chen (University of Connecticut); Radu V. Craiu (University of Toronto); Faming Liang (University of Florida); and Chuanhai Liu (Purdue University)

BIG-SIR: a Sliced Inverse Regression approach for massive data

Pages: 509 – 520

DOI: http://dx.doi.org/10.4310/SII.2016.v9.n4.a10

Authors

Benoit Liquet (Laboratoire de Mathématiques et de leurs Applications, Université de Pau et des Pays de l’Adour, Pau, France; and ARC Centre of Excellence for Mathematical and Statistical Frontiers, Queensland University of Technology (QUT), Brisbane, Australia)

Jerome Saracco (Bordeaux INP, Inria Bordeaux Sud Ouest, Talence, France)

Abstract

In a massive data setting, we focus on a semiparametric regression model involving a real dependent variable $Y$ and a $p$-dimensional covariate $X$ (with $p \geq 1$). This model includes a dimension reduction of $X$ via an index $X^{\prime} \beta$. The Effective Dimension Reduction (EDR) direction $\beta$ cannot be directly estimated by the Sliced Inverse Regression (SIR) method due to the large volume of the data. To deal with the main challenges of analyzing massive data sets which are the storage and computational efficiency, we propose a new SIR estimator of the EDR direction by following the “divide and conquer” strategy. The data is divided into subsets. EDR directions are estimated in each subset which is a small data set. The recombination step is based on the optimization of a criterion which assesses the proximity between the EDR directions of each subset. Computations are run in parallel with no communication among them.

The consistency of our estimator is established and its asymptotic distribution is given. Extensions to multiple indices models, $q$-dimensional response variable and/or $\mathrm{SIR}_{\alpha}$-based methods are also discussed. A simulation study using our $\texttt{edrGraphicalTools}$ $\mathsf{R}$ package shows that our approach enables us to reduce the computation time and conquer the memory constraint problem posed by massive data sets. A combination of $\texttt{foreach}$ and $\texttt{bigmemory}$ $\mathsf{R}$ packages are exploited to offer efficiency of execution in both speed and memory. Results are visualized using the bin-summarise-smooth approach through the $\texttt{bigvis}$ $\mathsf{R}$ package. Finally, we illustrate our proposed approach on a massive airline data set.

Keywords

high performance computing, Effective Dimension Reduction (EDR), parallel programming, $\mathsf{R}$ software, Sliced Inverse Regression (SIR)

Full Text (PDF format)