Statistics and Its Interface

Volume 3 (2010)

Number 4

Simultaneous set-wise testing under dependence, with applications to genome-wide association studies

Pages: 501 – 511

DOI: https://dx.doi.org/10.4310/SII.2010.v3.n4.a8

Authors

Wenguang Sun (Department of Statistics, North Carolina State University, Raleigh, N.C., U.S.A.)

Wei Wang (Department of Computer Science, New Jersey Institute of Technology, Newark, N.J., U.S.A.)

Zhi Wei (Department of Computer Science, New Jersey Institute of Technology, Newark, N.J., U.S.A.)

Abstract

We consider the problem of identifying diseaseassociated genomic regions in genome-wide association studies (GWAS). It is shown that conventional single SNP analysis can be greatly improved by (i) exploiting the spatial dependency and (ii) conducing set-wise analysis. The SNP set association problem can be conceptualized as the problem of simultaneously testing a large number of sets of hypotheses. We use hidden Markov models to exploit the linkage disequilibrium information in GWAS data, based on which a data-driven screening procedure (GLIS) is proposed. GLIS is shown to be optimal in the sense that it has the smallest missed set rate (MSR) among all valid false set rate (FSR) procedures. The numerical results demonstrate that the proposed procedure controls the FSR at the desired level, enjoys certain optimality properties and outperforms conventional combined p-value methods.We apply the GLIS procedure to analyze a Type 1 diabetes (T1D) GWAS dataset for detecting T1D associated genomic regions. The results show that our proposed SNP set analysis not only provides better biological insights, but also increases the statistical power by pooling information from different samples.

Keywords

hidden Markov model, largescale multiple testing, conjunction test, partial conjunction test, genome-wide association studies

Published 1 January 2010