Cluster analysis

SPSS offers two separate approaches to cluster analysis, K-Means clustering (also called Quick clustering) and Hierarchical (or agglomerative) clustering. See below for procedures for carrying out both. In 2B7Y, I suggest that you use K-means clustering for your first analyses. Some annotated output from SPSS is available here and will be discussed in more detail in lectures. Hierarchical cluster analysis has its uses, but is a rather complicated can of worms, and different methods can give very different answers when applied to the same set of data.

Whichever method you use, once you have an analysis that you are happy with plot a scatter plot using factors 1 and 2 from the PCA, plotting different symbols for the different clusters. Look at the mean abundance of the commoner species in each cluster, and give provisional labels to the clusters based on which species are characteristic of each. What is the relationship between the clusters and the PCA? The results of the two should be broadly consistent with each others - if oak trees have a high positive correlation with PC 1, then clusters in which oak trees are common should be towards the top end of PC 1 on the graph.

If you have environmental data, calculate the mean of each variable for each cluster. If you have collected data on plant communities, you may be interested in putting each of your groups into an NVC category.

K-Means cluster analysis

K-means clustering was originally designed as a method that allowed very large data sets to be clustered in a feasible amount of time, when computers were rather slower than they are today. This explains its other name of "quick clustering". It requires the number of clusters to be specified in advance, and the initial number chosen may split natural groupings or combine two or more groups that are rather different from each other. When used with ecological data, it has the advantage of producing nice discrete groups that are usually easy to interpret. The main disadvantage is that there needs to be a certain amount of trial and error in choosing the number of clusters. A second disadvantage for the mathematically inclined is that the implementation of the procedure in SPSS is restricted to measuring distances between samples using Euclidean distance. This is not a major problem for our needs, but becomes important if you have presence/absence data.

To carry out the analysis, choose Classify>K-means Cluster from the Analyze menu. Copy all of your ecological variables across into the list, and specify the number of cluster that you want it to find. Try 10 clusters for large data sets, such as quadrat surveys of vegetation, try with 5 for smaller data sets, such as those on freshwater invertebrates. Request saving of cluster memberships.

In the printout, the "final cluster centres" table gives the mean abundance of each species in each of the clusters. this will enable you to give descriptive names to each cluster based on their dominant species. There is also a table of the number of samples in each cluster. Ignore the "iteration history" and "initial cluster centres" tables.

If you want to see how the algorithm works, have a look at: http://www.engr.sjsu.edu/~knapp/HCIRDFSC/C/k_means.htm

Hierarchical Cluster Analysis

There are a huge range of hierarchical cluster analysis methods available, which give different results depending upon which you choose. The two basic choices that need making are how you assess the similarity between samples, and how you combine the samples into clusters. As with PCA, you have a choice on whether to standardise the data to give all species equal weights. The analysis outlined here uses a distance method that measures similarity between samples in a way that is consistent with the way that PCA treats distances. It also uses the most pessimistic clustering method, which will only identify nice clean clusters if these really exist in the data.

Select Analyse>Classify>Hierarchical Clustering. . Highlight the names of all your species and move them over into the variables box. Click on the Method button and ensure that method is set to Nearest Neighbour, Measure is set to Euclidean distance and standardise is set to None. Click on the Plots button and tick the dendrogram box and choose the none button in the icicle plot box. Then run the analysis. Look at the dendrogram and see if there are any nice clear clusters visible. If there are not, repeat the analysis using between groups linkage as the method. In either case, try to identify a reasonable number of clusters into which to split your data, then repeat the cluster analysis and click on the Save button and request that cluster memberships are saved as a new variable. Then plot the scatter plot using factors 1 and 2 from the PCA, plotting different symbols for the different clusters. What is the relationship between the clusters and the PCA?

Experiment with clustering methods and distance measures until you get a manageable number of clusters. Then calculate the mean abundance of each species in each cluster, and try to characterise them in terms of their dominant species.