Choosing Cluster Levels

Page 77 out of 225 pages in this book.
Tripos Bookshelf > QSAR > SAMPLS and Advanced CoMFA Tutorials > Hierarchical Analysis Tutorial

3.3.4 Choosing Cluster Levels

1. The Generate Cluster Column(s) dialog which now appears includes a listing of the ten best clustering levels for this dendrogram; a slide bar is provided to let you view more if you need to.

Each number of clusters is characterized by the relative distance to the next level of clustering. The average distance is 1; the distance to the next level is large if relatively dissimilar compounds must be lumped together to reduce the number of clusters by one at that level, and the level in question is "natural". In this case, the most natural clusterings are into sixteen and seventeen groups, which is not very useful. The next best clustering is at two clusters, which is at a relative distance of 1.931 from the next (three-cluster) level, but three clusters is almost as good at 1.453. An examination of the dendrogram in D1 shows that either two or three would be a reasonable clustering level to choose.

But we also need higher resolution levels to get useful insights into the dataset. The best looking levels at roughly 2-fold steps down from 3 are 7 and 11.

Click on those levels containing 2, 3, 7, and 11 clusters.

Levels Selected

Set the General Cluster Column(s) dialog as follows:

Set the Disperse option menu to No Levels.

Make sure that Sort by Indices is turned on.

The hierarchical analysis name HB, which is provided by default, will be fine as Root name for new columns.

Press Add Columns.

Press End to close the Hierarchical Clustering Analysis dialog and re-enable the spreadsheet.

2. Four integer columns (X2_HB, X3_HB, X7_HB and X11_HB) have been appended to my_ryanoids. Had the Disperse option menu in the Generate Cluster Column(s) dialog been set to Lowest Level, X11_HB would have been EXPLICIT. The integer part of the X11_HB column would correspond to the cluster number, whereas the decimal part would be a random number drawn from the interval 0.1 to 0.9.

Notice how the cluster levels are layered. Clusters 1 in X2 is split into clusters 1 and 2 in X3, whereas cluster 2 in X2 corresponds directly to cluster 3 in X3. Similarly, clusters 1, 2 and 3 in X11 make up cluster 1 in X7, whereas clusters 5 through 9 in X11 all come from cluster 2 in X3.

Having the cluster numbers for related clusters be close to one another can be very useful, particularly in graphical analyses. One must keep in mind, however, that the ordering is only partial; there will always be at least one case where consecutive clusters numbers are assigned to unrelated clusters -- e.g., 9 and 10 in X11 for this analysis.