Cophenetic correlation
In statistics, and especially in biostatistics, cophenetic correlation (more precisely, the cophenetic correlation coefficient) is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points. Although it has been most widely applied in the field of biostatistics (typically to assess cluster-based models of DNA sequences, or other taxonomic models), it can also be used in other fields of inquiry where raw data tend to occur in clumps, or clusters.[1]
Calculating the cophenetic correlation coefficient
Suppose that the original data {Xi} have been modeled using a cluster method to produce a dendrogram {Ti}; that is, a simplified model in which data that are "close" have been grouped into a heirarchical tree. Define the following distance measures.
- x(i, j) = | Xi − Xj |, the ordinary Euclidean distance between the ith and jth observations.
- t(i, j) = the dendrogrammatic distance between the model points Ti and Tj. This distance is always integral; it's the number of steps required to move from node i down the tree to the point at which i and j share a common node, then back up the tree to node j.
Then, letting x be the average of the x(i, j), and letting t be the average of the t(i, j), the cophenetic correlation coefficient c is given by[2]
- <math>
c = \frac {\sum_{i<j} (x(i,j) - x)(t(i,j) - t)}{\sqrt{[\sum_{i<j}(x(i,j)-x)^2] [\sum_{i<j}(t(i,j)-t)^2]}}. </math>
See also
References
- ↑ Dorthe B. Carr, Chris J. Young, Richard C. Aster, and Xioabing Zhang, Cluster Analysis for CTBT Seismic Event Monitoring (a study prepared for the U.S. Department of Energy)
- ↑ Mathworks statistics toolbox