Statistics and Its Interface

Volume 8 (2015)

Number 4

motifDiverge: a model for assessing the statistical significance of gene regulatory motif divergence between two DNA sequences

Pages: 463 – 476

DOI: https://dx.doi.org/10.4310/SII.2015.v8.n4.a6

Authors

Dennis Kostka (Department of Developmental Biology and Department of Computational & Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, U.S.A.)

Tara Friedrich (Integrative Program in Quantitative Biology, Gladstone Institutes, University of California at San Francisco)

Alisha K. Holloway (Division of Biostatistics, Gladstone Institutes, University of California at San Francisco)

Katherine S. Pollard (Institute for Human Genetics, Gladstone Institutes, University of California at San Francisco)

Abstract

Next-generation sequencing technology enables the identification of thousands of gene regulatory sequences in many cell types and organisms.We consider the problem of testing if two such sequences differ in their number of binding site motifs for a given transcription factor (TF) protein. Binding site motifs impart regulatory function by providing TFs the opportunity to bind to genomic elements and thereby affect the expression of nearby genes. Evolutionary changes to such functional DNA are hypothesized to be major contributors to phenotypic diversity within and between species; but despite the importance of TF motifs for gene expression, no method exists to test for motif loss or gain. Assuming that motif counts are Binomially distributed, and allowing for dependencies between motif instances in evolutionarily related sequences, we derive the probability mass function of the difference in motif counts between two nucleotide sequences. We provide a method to numerically estimate this distribution from genomic data and show through simulations that our estimator is accurate. Finally, we introduce the $\mathrm{R}$ package $\mathsf{motifDiverge}$ that implements our methodology and illustrate its application to gene regulatory enhancers identified by a mouse developmental time course experiment. While this study was motivated by analysis of regulatory motifs, our results can be applied to any problem involving two correlated Bernoulli trials.

Keywords

testing, gene regulation, motif, ChIP-seq, binomial, transcription factor, regulatory evolution

Published 19 October 2015