I am trying to learn more about agglomerative clustering from resources online by coming up with an algorithm rather than borrowing scipy routines; for my example, I am using the single-link criteria, which merges the two closest points together iteratively.

Using my algorithm, I have obtained the necessary data to form a dendrogram. Drawing one using matplotlib seems overly complicated, so I thought I'd borrow the scipy routine dendrogram just for plotting. In order to do so, I must input my result in such a way that the scipy routine understands. Below is a comparison of my algorithm and the scipy algorithm. I do not understand how their output is obtained. I do not think I've made a mistake, but please correct me if I am wrong.

## GET EXAMPLE DATA
x = np.array([0, 5, 25, 125, 625])
y = np.array([0, 4, 16, 64, 256])
z = np.array([1, 1, 1, 5, 1000])
# data = np.array([x, y])
data = np.array([x, y, z])

def get_all_possible_differences_per_dim(data, dim):
    """ """
    ## ADD DIMENSION AND BROADCAST SUBTRACTION
    x = np.reshape(data[dim], (len(data[dim]), 1))
    return x - x.transpose()

def get_euclidean_distance(points):
    """ """
    ## STANDARD DISTANCE METRIX
    return np.sqrt(np.sum(np.square(points), axis=0))

def single_link(data):
    """ """
    ## GET DISTANCE MATRIX
    ndim = data.shape[0]
    all_distances = np.array([get_all_possible_differences_per_dim(data, dim) for dim in range(ndim)])
    distance_matrix = get_euclidean_distance(all_distances)
    ## PAD DISTANCE MATRIX AT/BELOW DIAGONAL WITH LARGE DIFFERENCES
    ## GET ARGSORTED INDICES OF CLOSEST DATA POINTS USING NUMBER OF ELEMENTS CONTAINED IN ABOVE-DIAGONAL DISTANCES
    ## BELOW-DIAGONAL ENTRIES[i, j] = (-1) * ABOVE-DIAGONAL ENTRIES[j, i]; AVOID OVER-COUNTING
    for row in range(len(distance_matrix)):
        for col in range(len(distance_matrix[row])):
            if row >= col:
                distance_matrix[row, col] = np.inf
    res = {}
    ## FLATTEN COPY OF DISTANCE MATRIX TO ONE DIMENSION
    arr = np.reshape(distance_matrix, -1)
    n = 2*distance_matrix.shape[0]
    ## USE ARGSORTED INDEX IN CONJUNCTION WITH NUMBER OF ROWS AND COLUMNS
    ## LOCATE i-th DATAPOINT (via ROW) WITH THE j-th DATA POINT (Via COLUMN)
    rows, cols = np.divmod(np.argsort(arr)[:n], distance_matrix.shape[0])
    ## GET INDICES
    indices = np.array([rows, cols]).T
    res['indices'] = indices
    ## GET DISTANCES AT INDICES
    res['distances'] = distance_matrix[indices[:, 0], indices[:, 1]]
    res['distance matrix'] = distance_matrix
    return res

To view the results obtained from the algorithm above,

my_result = single_link(data)
for key, value in my_result.items():
    print("\n .. {} ({}):\n{}\n".format(key, value.shape, value))

produces the output:

 .. distances ((10,)):
[   6.40312424   23.32380758   29.68164416  110.9954954   134.22369388
  140.48843369 1129.99513273 1189.79031766 1202.45789947 1205.88639598]


 .. indices ((10, 2)):
[[0 1]
 [1 2]
 [0 2]
 [2 3]
 [1 3]
 [0 3]
 [3 4]
 [2 4]
 [1 4]
 [0 4]]


 .. distance matrix ((5, 5)):
[[          inf    6.40312424   29.68164416  140.48843369 1205.88639598]
 [          inf           inf   23.32380758  134.22369388 1202.45789947]
 [          inf           inf           inf  110.9954954  1189.79031766]
 [          inf           inf           inf           inf 1129.99513273]
 [          inf           inf           inf           inf           inf]]

From the above, we identify the first pair of indices as [0, 1], which corresponds to the 0-th row and 1-st column of the distance matrix. The i-th pair of indices correspond to the i-th shortest distances away.

To view the result from the scipy algorithm, run this:

scipy_result = linkage(data, method='single')
print("\n .. SCIPY RESULT ({}):\n{}\n".format(scipy_result.shape, scipy_result))

which produces:

 .. SCIPY RESULT ((2, 4)):
[[  0.           1.         374.11762856   2.        ]
 [  2.           3.         394.48447371   3.        ]]

To view the figure obtained via scipy,

fig = plt.figure()
dn = dendrogram(scipy_result)
plt.show()
plt.close(fig)  

dendrogram figure

I can see that the scipy array and my array my_result['distances'] have similar shape and likely correspond to each other. Clearly, they have reduced a dimension. But since I am argsorting, I should be collecting the same results in a different shape. Can someone help me see what I am doing wrong?

  [1]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage

0 Answers