Statistics and Its Interface
Volume 1 (2008)
Nonparametric clustering of functional data
Pages: 47 – 62
This paper presents a method for effectively detecting unknown patterns or clusters in high dimensional functional data. Examples of such data include gene expression levels measured over time from microarray experiments, functional magnetic resonance imaging (fMRI), mass spectrometry data from proteinomics, lipidomics etc. We define clusters through the unknown high dimensional multivariate distributions of all observations along each curve. Kullback-Leibler information and Mahalanobis generalized squared distance can fail to provide meaningful measure of distance between distributions in such high dimensional setting. We propose a new similarity measure and an agglomerative clustering algorithm, called PCLUST, to effectively differentiate among high dimensional populations. The algorithm produces invariant results under monotone transformations of data and does not require users to specify the number of clusters. Simulations show that PCLUST significantly outperforms 9 other popular algorithms in both clustering accuracy and robustness. An application in identifying biomarkers using time course gene expression data from Arabidopsis in response to environmental stresses is illustrated.
cluster analysis, nonparametric inference, hypothesis testing, mixture model, high dimensional multivariate analysis, time course gene expression microarray data, lipid metabolism
2010 Mathematics Subject Classification
Primary 60H30, 62G10, 62G35. Secondary 62P10.