R Using Purrr Map function to calculate Silhouette distances of KMeans model

352 views Asked by At

Simple KMeans model

Data

library(tidyverse)
library(broom)
library(cluster)
set.seed(42)
data = tibble(x = c(rnorm(20, 5, 1), rnorm(20, 10, 3)),
              y = c(rnorm(20, 5, 1), rnorm(20, 10, 3)))

Model

kmeans_k2 <- kmeans(data, 2, 9)

Tidymodel Results

tidy(kmeans_k2)
# A tibble: 2 x 5
      x     y  size withinss cluster
  <dbl> <dbl> <int>    <dbl> <fct>  
1  9.60 10.8     18    255.  1      
2  5.22  5.20    22     76.6 2      
glance(kmeans_k2)
# A tibble: 1 x 4
  totss tot.withinss betweenss  iter
  <dbl>        <dbl>     <dbl> <int>
1  836.         331.      505.     1
augment(kmeans_k2, data)
# A tibble: 40 x 3
       x     y .cluster
   <dbl> <dbl> <fct>   
 1  6.37  5.21 2       
 2  4.44  4.64 2       
 3  5.36  5.76 2       
 4  5.63  4.27 2       
 5  5.40  3.63 2       
 6  4.89  5.43 2       
 7  6.51  4.19 2       
 8  4.91  6.44 2       
 9  7.02  4.57 2       
10  4.94  5.66 2       
# ... with 30 more rows

Silhouette Calculation

sil_k2 <- silhouette(kmeans_k2$cluster, dist(data))
summary(sil_k2)
Silhouette of 40 units in 2 clusters from silhouette.default(x = kmeans_k2$cluster, dist = dist(data)) :
 Cluster sizes and average silhouette widths:
       18        22 
0.3527383 0.7053056 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.04243  0.45422  0.58233  0.54665  0.74472  0.79591 
summary(sil_k2)$si.summary[4]
     Mean 
0.5466503 

Nested tibble using Purrr's map function to create results for multiple different k's

kmeans_k123 <- tibble(k = 1:3) %>%
  mutate(km_model = map(k, ~kmeans(data, .x)),
         tidydata = map(km_model, tidy),
         glancedata = map(km_model, glance),
         augmentdata = map(km_model, augment, data))
kmeans_k123
# A tibble: 3 x 5
      k km_model tidydata         glancedata       augmentdata      
  <int> <list>   <list>           <list>           <list>           
1     1 <kmeans> <tibble [1 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
2     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
3     3 <kmeans> <tibble [3 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
pluck(kmeans_k23, 3, 2)
# A tibble: 2 x 5
      x     y  size withinss cluster
  <dbl> <dbl> <int>    <dbl> <fct>  
1  5.63  5.10    26     117. 1      
2 11.3  10.7     14     176. 2      

Question is, How do I add the Silhouette score to the nested tibble? The Silhouette function needs the clusters of each model and I'm not sure how to do that. Obviously I can pluck out a single instance, such as

data_k2cluster <- pluck(kmeans_k123, 2, 2)$cluster
data_k2cluster
 [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1
sil_k2v2 <- silhouette(data_k2cluster, dist(data))
summary(sil_k2v2)
Silhouette of 40 units in 2 clusters from silhouette.default(x = data_k2cluster, dist = dist(data)) :
 Cluster sizes and average silhouette widths:
       18        22 
0.3527383 0.7053056 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.04243  0.45422  0.58233  0.54665  0.74472  0.79591 

But when I try and use that with map it doesn't work

kmeans_k123 %>% mutate(sildata = map2(km_model$cluster, data, silhouette))
Error: Problem with `mutate()` input `sildata`.
x Input `sildata` can't be recycled to size 3.
i Input `sildata` is `map2(km_model$cluster, data, silhouette)`.
i Input `sildata` must be size 3 or 1, not 0.

I can create a function, which again works for a single occurence

my_fn <- function(f_cluster, f_data){my_fn <- silhouette(f_cluster, dist(f_data))}
summary(my_fn(kmeans_k2$cluster, data))
Silhouette of 40 units in 2 clusters from silhouette.default(x = f_cluster, dist = dist(f_data)) :
 Cluster sizes and average silhouette widths:
       18        22 
0.3527383 0.7053056 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.04243  0.45422  0.58233  0.54665  0.74472  0.79591 

But fails when I use it with map.

kmeans_k123 %>% mutate(sildata = map2(km_model$cluster, data, my_fn))
Error: Problem with `mutate()` input `sildata`.
x Input `sildata` can't be recycled to size 3.
i Input `sildata` is `map2(km_model$cluster, data, my_fn)`.
i Input `sildata` must be size 3 or 1, not 0.

I suspect that the problem is related to how I am trying to retrieve $cluster from the nested models as I've tried extracting that to create it's own column, but can't get that to work either.

1

There are 1 answers

1
Phil On BEST ANSWER

Putting it as an answer, because comments don't really allow for a ton of code.

The following works for me:

kmeans_k123 <- tibble(k = 1:3) %>%
  mutate(km_model = map(k, ~kmeans(data, .x)),
         tidydata = map(km_model, tidy),
         glancedata = map(km_model, glance),
         augmentdata = map(km_model, augment, data),
         silhouettedata = map(augmentdata, ~ silhouette(as.numeric(levels(.x$.cluster))[.x$.cluster], dist(data))))

unnest(kmeans_k123, silhouettedata)

# A tibble: 81 x 6
       k km_model tidydata         glancedata       augmentdata       silhouettedata[,"cluster"] [,"neighbor"] [,"sil_width"]
   <int> <list>   <list>           <list>           <list>                                 <dbl>         <dbl>          <dbl>
 1     1 <kmeans> <tibble [1 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                         NA            NA         NA    
 2     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.743
 3     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.776
 4     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.772
 5     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.771
 6     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.742
 7     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.794
 8     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.723
 9     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.713
10     2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>                          2             1          0.683

About the usage of as.numeric(levels(.x$.cluster))[.x$.cluster], this is because broom::tidy() turns the cluster variable into a factor, and cluster::silhouette() requires the cluster variable to be numeric. This answer provides why you use that particular line of code to convert a factor of numbers into a numeric variable.