|
|
|
A k-means cluster analysis of a data set is an iterative procedure that assigns data points to k different clusters in such a way that each member of a cluster is closer to the centroid of that cluster than to the centroid of any other cluster to which other data points have been assigned. However, whenever cluster analysis is performed, one question that must be answered is “How many clusters should be used?” (in other words, what should the value of k be?). An answer to this question is called a stopping rule. Unfortunately, no single stopping rule has been agreed upon (Aldenderfer & Blashfield, 1984; Everitt, 1980). As a result, there exist many different types of methods for determining k (Milligan & Cooper, 1985).
While no general method exists for determining the optimal number of clusters, one can take advantage of heuristic information concerning the domain that is being clustered to come up with a satisfactory stopping rule for this domain. Dawson et al. (2000) argued that when the hidden unit activities of a trained network are being clustered, there must be a correct mapping from these activities to output responses, because a trained network has itself has discovered one such mapping. They used this position to create the following stopping rule: extract the smallest number of clusters such that every hidden unit activity vector assigned to the same cluster produces the same output response in the network.
References:
- Aldenderfer, M. S., & Blashfield, R. K. (1984). Cluster Analysis (Vol. 07-044). Beverly Hills, CA: Sage Publications.
- Dawson, M. R. W., Medler, D. A., McCaughan, D. B., Willson, L., & Carbonaro, M. (2000). Using extra output learning to insert a symbolic theory into a connectionist network. Minds And Machines, 10, 171-201.
- Everitt, B. (1980). Cluster Analysis. New York: Halsted.
- Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159-179.
(Added April 2011)
|
|
|
|