Hierarchical clustering of text, at scale

Question

Hierarchical clustering of text, at scale

634 views Asked by Alex Woolford At 25 June 2015 at 17:42

I have a large dataset (billions of records) that almost entirely consists of categorical variables. This data will be used to predict a fairly rare numerical outcome. Most of the attributes have high cardinality: just maybe a handful of samples fall into each category.

I read read an interesting paper where they use hierarchical clustering in R to reduce the dimensionality for a similar task. The following code worked on small datasets, but rapidly consumed all the memory when scaling up:

library(cluster)
numClusters = 42
hc <- hclust(daisy(df))
cutree(hc, k = numClusters)

I considered using Mahout's kmeans implementation, but this wouldn't work for categorical variables since it expects a vector of doubles.

Does anyone in the StackOverflow community have any thoughts/advice on how to perform hierarchical clustering on categorical variables at scale?

Original Q&A

There are 1 answers

**knb** · Accepted Answer · 2015-06-27T10:32:10+00:00

I don't know how to answer "how to scale hclust up?" for your dataset. Throw more hardware/RAM at the problem, and/or search for a clever distributed implementation (Spark MLLib 1.4 does not implement Hierarchical Clustering, though).

You question is slightly confusing, read on why I think so.

I don't understand how hierarchical clustering will, in the end, help you with predicting the numerical class attribute.

If you definitely need to do clustering on categorical attributes, check the implementation of EM clusterer / algorithm, e.g. in R package RWeka. That implementation of EM by default determines the best number of clusters itself (maybe at a local minimum?) and it gives you probabilities for each categorical value in a hierarchy, and prior probabilities for each cluster.

You can use other clustering algorithms or packages.

If you already have a training set with class attributes, you might as well try RWeka::J48() classifier to create a decision tree and some predictions. You'll get something hierarchical, and, after tuning, with as much levels as you want.

If you don't want to use RWeka, maybe use package rpart::rpart() for decision tree.

TechQA.

Hierarchical clustering of text, at scale

There are 1 answers

Related Questions in CLUSTER-ANALYSIS

Related Questions in MAHOUT

Related Questions in HIERARCHICAL-CLUSTERING

Popular Questions

Popular Tags

Trending Questions