Hierarchical clustering for centers of kmeans in R

268 views Asked by At

I have a huge data set (200,000 rows * 40 columns) where each row represents an observation and each column is a variable. For this data, I would like to do hierarchical clustering. Unfortunately, as the number of rows is huge, then it is impossible to do this using my computer since I need to compute the distance matrix for all pairs of observations so (200,000 * 200,000) matrix.

The answer of this question suggests to use first kmeans to calculate a number of centers, then to perform the hierarchical clustering on the coordinates of these centers using the library FactoMineR.

The problem: I keep getting an error when applying the same method!

#example

# Data
MyData <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
           matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

kClust_MyData <- kmeans(MyData, 1000, iter.max=20)
Hclust_MyData <- HCPC(kClust_MyData$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(Hclust_MyData, choice="tree")

But

Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w = res.sauv$call$row.w.init) : 
  object 'data.clust' not found
1

There are 1 answers

0
cdalitz On

The package fastcluster has a method hclust.vector that does not require a distance matrix as input, but computes the distances itself in a more memory efficient way. From the fastcluster manual:

The call
hclust.vector(X, method='single', metric=[...])
is equivalent to
hclust(dist(X, metric=[...]), method='single')
but uses less memory and is equally fast