Simple KMeans model
Data
library(tidyverse)
library(broom)
library(cluster)
set.seed(42)
data = tibble(x = c(rnorm(20, 5, 1), rnorm(20, 10, 3)),
y = c(rnorm(20, 5, 1), rnorm(20, 10, 3)))
Model
kmeans_k2 <- kmeans(data, 2, 9)
Tidymodel Results
tidy(kmeans_k2)
# A tibble: 2 x 5
x y size withinss cluster
<dbl> <dbl> <int> <dbl> <fct>
1 9.60 10.8 18 255. 1
2 5.22 5.20 22 76.6 2
glance(kmeans_k2)
# A tibble: 1 x 4
totss tot.withinss betweenss iter
<dbl> <dbl> <dbl> <int>
1 836. 331. 505. 1
augment(kmeans_k2, data)
# A tibble: 40 x 3
x y .cluster
<dbl> <dbl> <fct>
1 6.37 5.21 2
2 4.44 4.64 2
3 5.36 5.76 2
4 5.63 4.27 2
5 5.40 3.63 2
6 4.89 5.43 2
7 6.51 4.19 2
8 4.91 6.44 2
9 7.02 4.57 2
10 4.94 5.66 2
# ... with 30 more rows
Silhouette Calculation
sil_k2 <- silhouette(kmeans_k2$cluster, dist(data))
summary(sil_k2)
Silhouette of 40 units in 2 clusters from silhouette.default(x = kmeans_k2$cluster, dist = dist(data)) :
Cluster sizes and average silhouette widths:
18 22
0.3527383 0.7053056
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04243 0.45422 0.58233 0.54665 0.74472 0.79591
summary(sil_k2)$si.summary[4]
Mean
0.5466503
Nested tibble using Purrr's map function to create results for multiple different k's
kmeans_k123 <- tibble(k = 1:3) %>%
mutate(km_model = map(k, ~kmeans(data, .x)),
tidydata = map(km_model, tidy),
glancedata = map(km_model, glance),
augmentdata = map(km_model, augment, data))
kmeans_k123
# A tibble: 3 x 5
k km_model tidydata glancedata augmentdata
<int> <list> <list> <list> <list>
1 1 <kmeans> <tibble [1 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
2 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
3 3 <kmeans> <tibble [3 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
pluck(kmeans_k23, 3, 2)
# A tibble: 2 x 5
x y size withinss cluster
<dbl> <dbl> <int> <dbl> <fct>
1 5.63 5.10 26 117. 1
2 11.3 10.7 14 176. 2
Question is, How do I add the Silhouette score to the nested tibble? The Silhouette function needs the clusters of each model and I'm not sure how to do that. Obviously I can pluck out a single instance, such as
data_k2cluster <- pluck(kmeans_k123, 2, 2)$cluster
data_k2cluster
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1
sil_k2v2 <- silhouette(data_k2cluster, dist(data))
summary(sil_k2v2)
Silhouette of 40 units in 2 clusters from silhouette.default(x = data_k2cluster, dist = dist(data)) :
Cluster sizes and average silhouette widths:
18 22
0.3527383 0.7053056
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04243 0.45422 0.58233 0.54665 0.74472 0.79591
But when I try and use that with map it doesn't work
kmeans_k123 %>% mutate(sildata = map2(km_model$cluster, data, silhouette))
Error: Problem with `mutate()` input `sildata`.
x Input `sildata` can't be recycled to size 3.
i Input `sildata` is `map2(km_model$cluster, data, silhouette)`.
i Input `sildata` must be size 3 or 1, not 0.
I can create a function, which again works for a single occurence
my_fn <- function(f_cluster, f_data){my_fn <- silhouette(f_cluster, dist(f_data))}
summary(my_fn(kmeans_k2$cluster, data))
Silhouette of 40 units in 2 clusters from silhouette.default(x = f_cluster, dist = dist(f_data)) :
Cluster sizes and average silhouette widths:
18 22
0.3527383 0.7053056
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04243 0.45422 0.58233 0.54665 0.74472 0.79591
But fails when I use it with map.
kmeans_k123 %>% mutate(sildata = map2(km_model$cluster, data, my_fn))
Error: Problem with `mutate()` input `sildata`.
x Input `sildata` can't be recycled to size 3.
i Input `sildata` is `map2(km_model$cluster, data, my_fn)`.
i Input `sildata` must be size 3 or 1, not 0.
I suspect that the problem is related to how I am trying to retrieve $cluster from the nested models as I've tried extracting that to create it's own column, but can't get that to work either.
Putting it as an answer, because comments don't really allow for a ton of code.
The following works for me:
About the usage of
as.numeric(levels(.x$.cluster))[.x$.cluster]
, this is becausebroom::tidy()
turns the cluster variable into a factor, andcluster::silhouette()
requires the cluster variable to be numeric. This answer provides why you use that particular line of code to convert a factor of numbers into a numeric variable.