Statistics and Its Interface

Volume 1 (2008)

Number 1

Partially Bayesian variable selection in classification trees

Pages: 155 – 167

DOI: https://dx.doi.org/10.4310/SII.2008.v1.n1.a13

Authors

Xuming He (Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, Il., U.S.A.)

Douglas A. Noe (Department of Mathematics and Statistics, Miami University, Oxford, Ohio, U.S.A.)

Abstract

Tree-structured models for classification may be split into two broad categories: those that are completely datadriven and those that allow some direct user interaction during model construction. Classifiers such as CART [3] and QUEST [11] are members of the first category. In those datadriven algorithms, all predictor variables compete equally for a particular classification task. However, in many cases a subject-area expert is likely to have some qualitative notion about their relative importance. Interactive algorithms such as RTREE [17] address this issue by allowing users to select variables at various stages of tree construction. In this paper, we introduce a more formal partially Bayesian procedure for dynamically incorporating qualitative expert opinions in the construction of classification trees. An algorithm that dynamically incorporates expert opinion in this way has two potential advantages, each improving with the quality of the expert. First, by de-emphasizing certain subsets of variables during the estimation process, machine-based computational activity can be reduced. Second, by giving an expert’s preferred variables priority, we reduce the chance that a spurious variable will appear in the model. Hence, our resulting models are potentially more interpretable and less unstable than those generated by purely data-driven algorithms.

Keywords

feature selection, expert opinion, supervised learning

Published 1 January 2008