How to explain a higher percentage of point variability using kmeans clustering?

2.1k views Asked by At

I'm doing some kmeans clustering:

enter image description here

Regardless of how many clusters I choose to use, the percentage of point variability does not change:

enter image description here

Here's how I am plotting my data:

# Prepare Data
mydata <- read.csv("~/student-mat.csv", sep=";")

# Let's only grab the numeric columns
mydata <- mydata[,c("age","Medu","Fedu","traveltime","studytime","failures","fam

mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata) # standardize variables ibrary(ggplot2)

# K-Means Clustering with 5 clusters
fit <- kmeans(mydata, 5) #to change number of clusters, I change the "5"

# Cluster Plot against 1st 2 principal components

# vary parameters for most readable graph
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
   labels=0, lines=0)

How do we affect the percentage of point variability?

2

There are 2 answers

0
Forrest R. Stevens On BEST ANSWER

The amount of variance explained is related to the two principal components calculated to visualize your data. This has nothing to do with the type of clustering algorithm or the accuracy of the algorithm that you're using (kmeans in this case).

To understand how accurate your clustering algorithm is at the very least you can use table() to construct a cross-classification table with your observed data and typically some data you've held out of the clustering process. Then using that cross-tabulation/confusion matrix you can calculate metrics like User's/Producer's accuracy, etc. There are far more sophisticated approaches of course, but hopefully that can get you started thinking about the best way to assess your classification accuracy.

0
Has QUIT--Anony-Mousse On

k-means is not "explaining" variance.

The number refers to the visualization that clusplot automagically does for you. So you've been mislead by too much automation.

Judging from the plot, I'd say the data doesn't cluster with k-means.