How to explain a higher percentage of point variability using kmeans clustering?

Question

How to explain a higher percentage of point variability using kmeans clustering?

2.1k views Asked by Alex Gordon At 14 June 2015 at 16:29

I'm doing some kmeans clustering:

enter image description here

Regardless of how many clusters I choose to use, the percentage of point variability does not change:

enter image description here

Here's how I am plotting my data:

# Prepare Data
mydata <- read.csv("~/student-mat.csv", sep=";")

# Let's only grab the numeric columns
mydata <- mydata[,c("age","Medu","Fedu","traveltime","studytime","failures","fam

mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata) # standardize variables ibrary(ggplot2)

# K-Means Clustering with 5 clusters
fit <- kmeans(mydata, 5) #to change number of clusters, I change the "5"

# Cluster Plot against 1st 2 principal components

# vary parameters for most readable graph
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
   labels=0, lines=0)

How do we affect the percentage of point variability?

Original Q&A

There are 2 answers

Has QUIT--Anony-Mousse On 14 June 2015 at 20:02

k-means is not "explaining" variance.

The number refers to the visualization that clusplot automagically does for you. So you've been mislead by too much automation.

Judging from the plot, I'd say the data doesn't cluster with k-means.

**Forrest R. Stevens** · Accepted Answer · 2015-06-14T17:03:44+00:00

The amount of variance explained is related to the two principal components calculated to visualize your data. This has nothing to do with the type of clustering algorithm or the accuracy of the algorithm that you're using (kmeans in this case).

To understand how accurate your clustering algorithm is at the very least you can use table() to construct a cross-classification table with your observed data and typically some data you've held out of the clustering process. Then using that cross-tabulation/confusion matrix you can calculate metrics like User's/Producer's accuracy, etc. There are far more sophisticated approaches of course, but hopefully that can get you started thinking about the best way to assess your classification accuracy.

TechQA.

How to explain a higher percentage of point variability using kmeans clustering?

There are 2 answers

Related Questions in R

Related Questions in STATISTICS

Related Questions in CLUSTER-ANALYSIS

Related Questions in K-MEANS

Popular Questions

Popular Tags

Trending Questions