# Trying to better understand agglomerative hierarchical clustering by comparing output with scipy; why different ouputs?

I am trying to learn more about agglomerative clustering from resources online by coming up with an algorithm rather than borrowing scipy routines; for my example, I am using the single-link criteria, which merges the two closest points together iteratively.

Using my algorithm, I have obtained the necessary data to form a dendrogram. Drawing one using matplotlib seems overly complicated, so I thought I'd borrow the scipy routine `dendrogram` just for plotting. In order to do so, I must input my result in such a way that the scipy routine understands. Below is a comparison of my algorithm and the scipy algorithm. I do not understand how their output is obtained. I do not think I've made a mistake, but please correct me if I am wrong.

``````## GET EXAMPLE DATA
x = np.array([0, 5, 25, 125, 625])
y = np.array([0, 4, 16, 64, 256])
z = np.array([1, 1, 1, 5, 1000])
# data = np.array([x, y])
data = np.array([x, y, z])

def get_all_possible_differences_per_dim(data, dim):
""" """
x = np.reshape(data[dim], (len(data[dim]), 1))
return x - x.transpose()

def get_euclidean_distance(points):
""" """
## STANDARD DISTANCE METRIX
return np.sqrt(np.sum(np.square(points), axis=0))

""" """
## GET DISTANCE MATRIX
ndim = data.shape
all_distances = np.array([get_all_possible_differences_per_dim(data, dim) for dim in range(ndim)])
distance_matrix = get_euclidean_distance(all_distances)
## PAD DISTANCE MATRIX AT/BELOW DIAGONAL WITH LARGE DIFFERENCES
## GET ARGSORTED INDICES OF CLOSEST DATA POINTS USING NUMBER OF ELEMENTS CONTAINED IN ABOVE-DIAGONAL DISTANCES
## BELOW-DIAGONAL ENTRIES[i, j] = (-1) * ABOVE-DIAGONAL ENTRIES[j, i]; AVOID OVER-COUNTING
for row in range(len(distance_matrix)):
for col in range(len(distance_matrix[row])):
if row >= col:
distance_matrix[row, col] = np.inf
res = {}
## FLATTEN COPY OF DISTANCE MATRIX TO ONE DIMENSION
arr = np.reshape(distance_matrix, -1)
n = 2*distance_matrix.shape
## USE ARGSORTED INDEX IN CONJUNCTION WITH NUMBER OF ROWS AND COLUMNS
## LOCATE i-th DATAPOINT (via ROW) WITH THE j-th DATA POINT (Via COLUMN)
rows, cols = np.divmod(np.argsort(arr)[:n], distance_matrix.shape)
## GET INDICES
indices = np.array([rows, cols]).T
res['indices'] = indices
## GET DISTANCES AT INDICES
res['distances'] = distance_matrix[indices[:, 0], indices[:, 1]]
res['distance matrix'] = distance_matrix
return res
``````

To view the results obtained from the algorithm above,

``````my_result = single_link(data)
for key, value in my_result.items():
print("\n .. {} ({}):\n{}\n".format(key, value.shape, value))
``````

produces the output:

`````` .. distances ((10,)):
[   6.40312424   23.32380758   29.68164416  110.9954954   134.22369388
140.48843369 1129.99513273 1189.79031766 1202.45789947 1205.88639598]

.. indices ((10, 2)):
[[0 1]
[1 2]
[0 2]
[2 3]
[1 3]
[0 3]
[3 4]
[2 4]
[1 4]
[0 4]]

.. distance matrix ((5, 5)):
[[          inf    6.40312424   29.68164416  140.48843369 1205.88639598]
[          inf           inf   23.32380758  134.22369388 1202.45789947]
[          inf           inf           inf  110.9954954  1189.79031766]
[          inf           inf           inf           inf 1129.99513273]
[          inf           inf           inf           inf           inf]]
``````

From the above, we identify the first pair of `indices` as `[0, 1]`, which corresponds to the 0-th row and 1-st column of the distance matrix. The i-th pair of indices correspond to the i-th shortest distances away.

To view the result from the scipy algorithm, run this:

``````scipy_result = linkage(data, method='single')
print("\n .. SCIPY RESULT ({}):\n{}\n".format(scipy_result.shape, scipy_result))
``````

which produces:

`````` .. SCIPY RESULT ((2, 4)):
[[  0.           1.         374.11762856   2.        ]
[  2.           3.         394.48447371   3.        ]]
``````

To view the figure obtained via scipy,

``````fig = plt.figure()
dn = dendrogram(scipy_result)
plt.show()
plt.close(fig)
``````

I can see that the scipy array and my array `my_result['distances']` have similar shape and likely correspond to each other. Clearly, they have reduced a dimension. But since I am argsorting, I should be collecting the same results in a different shape. Can someone help me see what I am doing wrong?

``````  : https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage
``````