I have a huge data set (200,000 rows * 40 columns)
where each row represents an observation and each column is a variable. For this data, I would like to do hierarchical clustering
. Unfortunately, as the number of rows is huge, then it is impossible to do this using my computer since I need to compute the distance matrix for all pairs of observations so (200,000 * 200,000)
matrix.
The answer of this question suggests to use first kmeans
to calculate a number of centers, then to perform the hierarchical clustering
on the coordinates of these centers using the library FactoMineR
.
The problem: I keep getting an error when applying the same method!
#example
# Data
MyData <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
kClust_MyData <- kmeans(MyData, 1000, iter.max=20)
Hclust_MyData <- HCPC(kClust_MyData$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(Hclust_MyData, choice="tree")
But
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w = res.sauv$call$row.w.init) :
object 'data.clust' not found
The package fastcluster has a method hclust.vector that does not require a distance matrix as input, but computes the distances itself in a more memory efficient way. From the fastcluster manual: