I am trying to create 10 clusters using a segment variable while simultaneously meeting a criteria that the total value of a bound variable per cluster should at least be 10,000. Here's what I have done so far:
# Prepare data
set.seed(20240124)
data <- tigris::counties(
state = "IA", cb = TRUE, year = 2020, progress_bar = FALSE
) |>
dplyr::select(GEOID) |>
dplyr::mutate(
segment_var = rexp(99),
bound_var = rexp(99, rate = 0.0001)
)
dplyr::glimpse(data)
#> Rows: 99
#> Columns: 4
#> $ GEOID <chr> "19075", "19149", "19117", "19025", "19111", "19163", "191…
#> $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-93.02679 4..., MULTIPOLYGON …
#> $ segment_var <dbl> 1.40702949, 3.06134427, 3.01756730, 0.91757114, 0.82408377…
#> $ bound_var <dbl> 7625.26661, 2171.95732, 2688.26497, 369.21961, 4603.78202,…
# Find clusters
clusters <- data |>
sf::st_drop_geometry() |>
dplyr::select(segment_var) |>
dplyr::mutate(segment_var = scale(segment_var) |> as.vector()) |>
dist() |>
hclust() |>
cutree(k = 10)
# Augment data with cluster info
data_clustered <- data |>
dplyr::mutate(clust_id = as.factor(clusters))
dplyr::glimpse(data_clustered)
#> Rows: 99
#> Columns: 5
#> $ GEOID <chr> "19075", "19149", "19117", "19025", "19111", "19163", "191…
#> $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-93.02679 4..., MULTIPOLYGON …
#> $ segment_var <dbl> 1.40702949, 3.06134427, 3.01756730, 0.91757114, 0.82408377…
#> $ bound_var <dbl> 7625.26661, 2171.95732, 2688.26497, 369.21961, 4603.78202,…
#> $ clust_id <fct> 1, 2, 2, 3, 3, 4, 3, 5, 5, 6, 7, 8, 7, 8, 3, 8, 6, 7, 3, 8…
# Get the summary of bound variable by cluster
data_clustered |>
sf::st_drop_geometry() |>
dplyr::summarize(
sum_bound_var = sum(bound_var),
.by = clust_id
) |>
dplyr::pull(sum_bound_var) |>
range()
#> [1] 4860.222 234915.239
Created on 2024-01-24 with reprex v2.0.2
How can I force this additional constraint on the bound variable when producing clusters?