A given dataset (a) is better clustered by our optimally ordered orthogonal neighbor joining tree (O3NJTree)
(f) than by a dendrogram (d)
produced by the hierarchical clustering (HC) method with complete linkage. Subfigure (b) shows the nested
five clusters generated by cutting the
dendrogram with a minimum similarity bar; (c) shows four clusters automatically extracted with our method;
(e) displays the NJ tree with random leaf
order while (f) shows the same tree with optimal leaf order (left) and resulting cluster tree (right) as
described in the paper.
We propose to use optimally ordered orthogonal neighbor-joining (O3NJ) trees as a new way to visually explore cluster structures and outliers in multi-dimensional data. Neighbor-joining (NJ) trees are widely used in biology, and their visual representation is similar to that of dendrograms. The core difference to dendrograms, however, is that NJ trees correctly encode distances between data points, resulting in trees with varying edge lengths. We optimize NJ trees for their use in visual analysis in two ways. First, we propose to use a novel leaf sorting algorithm that helps users to better interpret adjacencies and proximities within such a tree. Second, we provide a new method to visually distill the cluster tree from an ordered NJ tree. Numerical evaluation and three case studies illustrate the benefits of this approach for exploring multi-dimensional data in areas such as biology or image analysis.
Figure 2: Illustration of the NJ algorithm and comparison between an NJ tree and a HC tree. (a) input distance matrix; (b) creating the NJ algorithm by using the input distance matrix; (c) orthogonal NJ tree where the root is the duplication of U4; (d) HC tree generated by applying the AHC method with average linkage according to the matrix in (a).
Figure 3: Comparison between ordinal hierarchical clustering (HC) and NJ trees generated from the same distance matrix as shown in (a). (b) Scatter plot with three clusters produced by multi-dimensional scaling; (c) HC tree where positions do not accurately resemble distances, e.g., the distance between I and A is 29 and the distance between I and B is 45, but the horizontal position of their least common ancestor (the root node) has the same horizontal path length to the nodes I, A and B. Moreover, cutting the HC tree into three branches (black line) results in incorrect groupings for I and D. NJ trees accurately encode the distance matrix, but so far no good ordering algorithm exists: (d) NJ tree displayed with a random order, as currently done in an R package ; (e) NJ tree ordered by using the OLO algorithm ; (f) NJ tree ordered by our algorithm, three useful clusters are created.
Figure 4: Overview of our leaf ordering algorithm: (a) an orthogonal NJ tree with unequal edge length where each node is assigned a weight based on the edge length; (b) by performing monotonicity based flipping within a breadth-first traversal the inversion is reduced; (c) extraction of the ordered sub-trees with a depth-first traversal; (d) the optimal leaf ordering result is obtained by applying the DP algorithm to the ordered sub-trees.
Figure 5: Distilling the cluster tree from an O3NJ tree includes splitting and hierarchy extraction: (a) input O3NJ tree and a histogram of the ARD values; (b) the selected threshold from the histogram in (a) cuts the tree into three parts; (c) identified peaks and valleys as well as corresponding two-level cluster tree for the middle branch; (d) distilled cluster tree and nested clusters shown as scatter plot (right).
Figure 6: Persistence-inspired simplification of an O3NJ tree: the optimally ordered version of Fig. 1(e) is the input shown in (a), a default threshold of 10% of the maximal persistence value shown in the persistence histogram in (b) removes most noisy peaks and valleys and results in the cluster hierarchy shown in (c).
Figure 7: Values of NMI (a) and RI (b) for the four different clustering methods: the blue boxplots show the score distributions over all 52 datasets and red boxplots give the score of the 33 real datasets.
Figure 8: (a) Clustering accuracy of the Isolet data generated by different methods, (b) cluster trees produced by our method (b) and the AHC method (c).
Figure 9: Exploration of the Cucumber data: (a) O3NJ tree and distilled hierarchy; (b) MDS plot; (c) heat map showing the input distance matrix. Three kinds of inconsistencies between our O3NJ tree representation and the MDS results are highlighted by black, green and blue circles in the O3NJ tree, the corresponding nodes are also highlighted in the MDS plot and the heap map. Class label and cluster index of each node are encoded by border and fill color.
Figure 10: Exploration of the CIFAR-10 image data with our distilled cluster tree: (a) clustering accuracy of several algorithms measured by the Rand Index; (b) four representative images of the cluster consisting of Horse and Deer classes; (c) sub-tree of the Car class shown in the O3NJ tree with images of one outlier and three selected nodes; (d) the distilled cluster tree, where each node is associated with a class label; (e) representative images of two clusters: Bird & Airplane and Car class; (f) representative images of two clusters: Bird & Airplane and Ship class.
This work was supported by the grants of the National Key R&D Program of China under Grant
2022ZD0160805, in part by NSFC under Grants 62132017 and 62141217, in part by Shandong Provincial
Natural Science Foundation under Grant ZQ2022JQ32, and the Shenzhen Science and Technology Program
(GJHZ20210705141402008), as well as in part by the DFG (German Research Foundation) under Germany’s
Excellence under Grant Strategy-EXC2117-422037984 and in part by the Project-ID 251654672 – TRR 161.