Previous IDL Analyst Reference Guide: Multivariate Analysis Next

IMSL_K_MEANS

Syntax | Return Value | Arguments | Keywords | Discussion | Example | Errors | Version History

The IMSL_K_MEANS function performs a K-means (centroid) cluster analysis.


Note
This routine requires an IDL Analyst license. For more information, contact your ITT Visual Information Solutions sales or technical support representative.

Syntax

Result = IMSL_K_MEANS(x, seeds [, COUNTS_CLUSTER=variable]
[, /DOUBLE] [, FREQUENCIES=array] [, ITMAX=value] [, MEANS_CLUSTER=variable] [, SSQ_CLUSTER=variable] [, VAR_COLUMNS=array] [, WEIGHTS=array] )

Return Value

The cluster membership for each observation is returned.

Arguments

seeds

Two-dimensional array containing the cluster seeds, i.e., estimates for the cluster centers. The seed value for the j-th variable of the i-th seed should be in seeds (i, j).

x

Two-dimensional array containing observations to be clustered. The data value for the i-th observation of the j-th variable should be in x(i, j).

Keywords

COUNTS_CLUSTER

Named variable into which an array containing the number of observations in each cluster is stored.

DOUBLE

If present and nonzero, double precision is used.

FREQUENCIES

One-dimensional array containing the frequency of each observation of matrix x. Default: Frequencies(*) = 1

ITMAX

Maximum number of iterations. Default: Itmax = 30

MEANS_CLUSTER

Named variable into which a two-dimensional array containing the cluster means is stored.

SSQ_CLUSTER

Named variable into which a one-dimensional array containing the within sum-of-squares for each cluster is stored.

VAR_COLUMNS

One-dimensional array containing the columns of x to be used in computing the metric. Columns are numbered 0, 1, 2, ..., N_ELEMENTS(x(0, *)). Default: Vars_Columns(*) = 0, 1, 2, ..., N_ELEMENTS(x(0, *)) – 1

WEIGHTS

One-dimensional array containing the weight of each observation of matrix x. Default: Weights(*) = 1

Discussion

The IMSL_K_MEANS function is an implementation of Algorithm AS 136 by Hartigan and Wong (1979). This function computes K-means (centroid) Euclidean metric clusters for an input matrix starting with initial estimates of the K-cluster means. The IMSL_K_MEANS function allows for missing values coded as NaN (Not a Number) and for weights and frequencies.

Let p = N_ELEMENTS(x (0, *)) be the number of variables to be used in computing the Euclidean distance between observations. The idea in K-means cluster analysis is to find a clustering (or grouping) of the observations so as to minimize the total within-cluster sums-of-squares. In this case, the total sums-of-squares within each cluster is computed as the sum of the centered sum-of-squares over all non-missing values of each variable. That is:

where nim denotes the row index of the m-th observation in the i-th cluster in the matrix X; ni is the number of rows of X assigned to group i; f denotes the frequency of the observation; w denotes its weight; d is 0 if the j-th variable on observation nim is missing, otherwise d is 1; and:

is the average of the non-missing observations for variable j in group i. This method sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease of the total within-cluster sums-of-squares. See Hartigan and Wong (1979) or Hartigan (1975) for details.

Example

This example performs K-means cluster analysis on Fisher's iris data, which is obtained by IMSL_STATDATA. The initial cluster seed for each iris type is an observation known to be in the iris type.

seeds = MAKE_ARRAY(3,4)  
x = IMSL_STATDATA(3)  
seeds(0, *) = x(0, 1:4)  
seeds(1, *) = x(50, 1:4)  
seeds(2, *) = x(100, 1:4)  
; Use Columns 1, 2, 3, and 4 of data matrix x, only.  
cluster_group = IMSL_K_MEANS(x(*, 1:4), seeds, $  
   Means_Cluster = means_cluster, Ssq_Cluster	= ssq_cluster, $  
   Counts_Cluster = counts_cluster)  
FORMAT = '(a, 10i4)'  
FOR i = 0, 140, 10 DO BEGIN &$  
   PRINT, 'observation: ',i + INDGEN(10)+1, $  
   FORMAT = format &$  
   PRINT, 'cluster: ', cluster_group(i:i+9), $  
   FORMAT = format &$  
   PRINT &$  
END  
; Print cluster membership in groups of 10.  
  
observation:  1   2   3   4   5   6   7   8   9  10  
   cluster    : 1   1   1   1   1   1   1   1   1   1  
observation: 11  12  13  14   15  16  17   18  19  20  
   cluster    : 1   1   1   1   1   1   1   1   1   1  
observation: 21  22  23  24   25  26  27   28  29  30  
   cluster    : 1   1   1   1   1   1   1   1   1   1  
observation: 31  32  33  34   35  36  37   38  39  40  
   cluster    : 1   1   1   1   1   1   1   1   1   1  
observation: 41  42  43  44   45  46  47   48  49  50  
   cluster    : 1   1   1   1   1   1   1   1   1   1  
observation: 51  52  53  54   55  56  57   58  59  60  
   cluster    : 2   2   3   2   2   2   2   2   2   2  
observation: 61  62  63  64   65  66  67   68  69  70  
   cluster    : 2   2   2   2   2   2   2   2   2   2  
observation: 71  72  73  74   75  76  77   78  79  80  
   cluster    : 2   2   2   2   2   2   2   3   2   2  
observation: 81  82  83  84   85  86  87   88  89  90  
   cluster    : 2   2   2   2   2   2   2   2   2   2  
observation: 91  92  93  94   95  96  97   98  99 100  
   cluster    : 2   2   2   2   2   2   2   2   2   2  
observation: 101 102 103 104  105 106 107  108 109 110  
   cluster    : 3   2   3   3   3   3   2   3   3   3  
observation: 111 112 113 114  115 116 117  118 119 120  
   cluster    : 3   3   3   2   2   3   3   3   3   2  
observation: 121 122 123 124  125 126 127  128 129 130  
   cluster    : 3   2   3   2   3   3   2   2   3   3  
observation: 131 132 133 134  135 136 137  138 139 140  
   cluster    : 3   3   3   2   3   3   3   3   2   3  
observation: 141 142 143 144  145 146 147  148 149 150  
   cluster    : 3   3   2   3   3   3   2   3   3   2  
  
PM, [[INDGEN(3) + 1],[means_cluster]], Title = 'Cluster Means:',$  
   FORMAT = '(i3, 5x, 4f8.4)'  
  
Cluster Means:  
   1       5.0060  3.4280  1.4620  0.2460  
   2       5.9016  2.7484  4.3935  1.4339  
   3       6.8500  3.0737  5.7421  2.0711  
  
PM, [[INDGEN(3) + 1],[ssq_cluster]], $  
   Title = 'Cluster Sums of Squares:', FORMAT = '(i3, 5x, f8.4)'  
  
Cluster Sums of Squares:  
   1      15.1510  
   2      39.8210   
   3      23.8795  
  
PM, [[INDGEN(3) + 1],[counts_cluster]], Title = $  
   'Number of Observations per Cluster:'  
  
Number of Observations per Cluster:  
   1          50  
   2          62  
   3          38  

Errors

Warning Errors

STAT_NO_CONVERGENCE—Convergence did not occur.

Version History

6.4
Introduced

  IDL Online Help (March 06, 2007)