Statistics and Its Interface
Volume 12 (2019)
Pólya urn model and its application to text categorization
Pages: 227 – 237
Pólya urn model is a basic model widely applied in statistics and text mining. Most algorithms to training the model are very slow and complicated so that it generally difficult to fit a Pólya urn model to big data sets. This paper proposes a new minorization-maximization (MM) algorithm for the maximum likelihood estimation (MLE) of the Pólya urn model in which the surrogate function is constructed by means of a simple convex function. The convergence of the MM algorithm is analyzed and the asymptotic normality of the corresponding MLE for non-identically distributed observations is also derived. The performance of this new MM algorithm is also compared with Newton method and other MM algorithms. The Pólya urn model is applied to text categorization. Its superiority to naive Bayes (NB) classifier, k-Nearest Neighbor (k-NN) and support vector machine (SVM) are demonstrated by a real newsgroup dataset.
Pólya urn model, minorization-maximization, asymptotic properties, text categorization
This research was supported by NSFC under grant No. 71771089.
Received 20 October 2017
Published 11 March 2019