I want to cluster multiple documents using Mahout. The clustering works fine but I have no idea how to find out which documents are located in each cluster.
I read that you can use the option --namedVector
when creating the sparse-files but where does it take the ID from and how can I retrieve this ID after the clustering is completed?
Right now I am doing the following steps:
I have a directory with a file for each document. The files are in the following format with the ID of the document as filename:
filename: documentID.txt
[TITLE]
[CONTENT]
I create a sparse directory with namedVectors using:
./mahout seqdirectory -i tmp/es-out -o tmp/es-out-seqdir -c UTF-8 -chunk 64 -xm sequential
./mahout seq2sparse -i tmp/es-out-seqdir -o tmp/es-out-sparse --maxDFPercent 85 --namedVector
Then I can cluster the results and create a dump:
./mahout kmeans -i tmp/es-out-sparse/tfidf-vectors -c tmp/es-kmeans-clusters -o tmp/es-kmeans -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -x 10 -k 20 -ow --clustering
./mahout clusterdump -i tmp/es-kmeans/clusters-10-final -o tmp/clusterdump -d tmp/es-out-sparse/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -sp 0 --pointsDir tmp/es-kmeans/clusteredPoints
The dump looks like this:
:VL-190{n=1 c=[1:3.407, 110:6.193, 2007:3.736, about:1.762, according:2.948, account:3.507, acting:6.
Top Terms:
epa => 13.471728324890137
mountaintop => 11.364262580871582
mine => 10.942587852478027
Weight : [props - optional]: Point:
[...]
I found a way. You can use the seqdumper to extract the cluster mapping:
Than you can use a regex to extract the mapping of the vector IDs to cluster IDs.