I'm doing some kmeans clustering:
Regardless of how many clusters I choose to use, the percentage of point variability does not change:
Here's how I am plotting my data:
# Prepare Data
mydata <- read.csv("~/student-mat.csv", sep=";")
# Let's only grab the numeric columns
mydata <- mydata[,c("age","Medu","Fedu","traveltime","studytime","failures","fam
mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata) # standardize variables ibrary(ggplot2)
# K-Means Clustering with 5 clusters
fit <- kmeans(mydata, 5) #to change number of clusters, I change the "5"
# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
labels=0, lines=0)
How do we affect the percentage of point variability?
The amount of variance explained is related to the two principal components calculated to visualize your data. This has nothing to do with the type of clustering algorithm or the accuracy of the algorithm that you're using (kmeans in this case).
To understand how accurate your clustering algorithm is at the very least you can use
table()
to construct a cross-classification table with your observed data and typically some data you've held out of the clustering process. Then using that cross-tabulation/confusion matrix you can calculate metrics like User's/Producer's accuracy, etc. There are far more sophisticated approaches of course, but hopefully that can get you started thinking about the best way to assess your classification accuracy.