I have a large dataset (billions of records) that almost entirely consists of categorical variables. This data will be used to predict a fairly rare numerical outcome. Most of the attributes have high cardinality: just maybe a handful of samples fall into each category.
I read read an interesting paper where they use hierarchical clustering in R to reduce the dimensionality for a similar task. The following code worked on small datasets, but rapidly consumed all the memory when scaling up:
library(cluster)
numClusters = 42
hc <- hclust(daisy(df))
cutree(hc, k = numClusters)
I considered using Mahout's kmeans implementation, but this wouldn't work for categorical variables since it expects a vector of doubles.
Does anyone in the StackOverflow community have any thoughts/advice on how to perform hierarchical clustering on categorical variables at scale?
I don't know how to answer "how to scale hclust up?" for your dataset. Throw more hardware/RAM at the problem, and/or search for a clever distributed implementation (Spark MLLib 1.4 does not implement Hierarchical Clustering, though).
You question is slightly confusing, read on why I think so.
I don't understand how hierarchical clustering will, in the end, help you with predicting the numerical class attribute.
If you definitely need to do clustering on categorical attributes, check the implementation of EM clusterer / algorithm, e.g. in R package RWeka. That implementation of EM by default determines the best number of clusters itself (maybe at a local minimum?) and it gives you probabilities for each categorical value in a hierarchy, and prior probabilities for each cluster.
You can use other clustering algorithms or packages.
If you already have a training set with class attributes, you might as well try RWeka::J48() classifier to create a decision tree and some predictions. You'll get something hierarchical, and, after tuning, with as much levels as you want.
If you don't want to use RWeka, maybe use package rpart::rpart() for decision tree.