BOOTSTRAPPING <i>K</i>-MEANS CLUSTERING

Access this Article

Search this Article

Author(s)

Abstract

Independent observations <i>X</i><sub>1</sub>, <i>X</i><sub>2</sub>, …, <i>X</i><sub>n</sub> are made on a distribution <i>F</i> on <i>R</i><sup>d</sup>. To devide these observations into <i>k</i> clusters, first choose a vector of optimal cluster centers <i>b</i><sub>n</sub>=(<i>b</i><sub>n1</sub>, <i>b</i><sub>n2</sub>, …, <i>b</i><sub>nk</sub>) to minimize <i>W</i><sub>n</sub>(<i>a</i>)=1/nΣni=1min<sub>1</sub>≤<i>j</i>≤<i>k</i>||<i>X</i><sub>i</sub>-<i>a</i><sub>j</sub>||<sup>2</sup> as a function of <i>a</i>=(<i>a</i><sub>1</sub>, <i>a</i><sub>2</sub>, …, <i>a</i><sub>k</sub>), then assign each observation to its nearest cluster center. Each <i>b</i><sub>nj</sub> is the mean of observations in its cluster. Pollard (1982) obtained a central limit theorem for the means of the <i>k</i>-clusters. In this paper, it is shown that the bootstrap distribution of the centered <i>b</i><sub>n</sub> has the same limiting distribution; the argument rests on asymptotic behavior of empirical processes on Vapnik-Chervonenkis classes in triangular array setting. Advantages of the bootstrap methods are discussed and the performance of bootstrap confidence sets is compared with Pollard's confidence sets by Monte Carlo simulation. <sup>2</sup>

Journal

  • Journal of the Japanese Society of Computational Statistics

    Journal of the Japanese Society of Computational Statistics 3(1), 1-14, 1990

    Japanese Society of Computational Statistics

Codes

Page Top