Communications in Information and Systems

Volume 2 (2002)

Number 1

Data reduction via adaptive sampling

Pages: 53 – 68

DOI: https://dx.doi.org/10.4310/CIS.2002.v2.n1.a3

Author

Xiao-Bai Li (School of Management, University of Texas at Dallas, Richardson, Tx., U.S.A.)

Abstract

Data reduction is an important issue in the field of data mining. This article describes a new method for selecting a subset of data from a large dataset. A simplified chi-square criterion is proposed for measuring the goodness-of-fit between the distributions of the reduced and full data sets. Under this criterion, the data reduction problem can be formulated as a binary quadratic program and a tabu search technique is used in the search/optimization process. The procedure is adaptive in that it involves not only random sampling but also deterministic search guided by the results of the previous search steps. The method is applicable primarily to discrete data, but can be extended to continuous data as well. An experimental study that compares the proposed method with simple random sampling on a number of simulated and real world datasets has been conducted. The results of the study indicate that the distributions of the samples produced by the proposed method are significantly closer to the true distribution than those of random samples.

Keywords

data reduction, data mining, chi-square, goodness-of-fit, tabu search, binary quadratic programming

Published 1 January 2002