r: use a repeat loop to create clusters of a limited size

130 views Asked by At

I have some code that clusters points together based on the distance between them. At the moment, if the number of points in a cluster exceeds four, the loop repeats with the distance required for points to be clustered is halved. With the current code, the loop repeats the calculation for all clusters, until no cluster has more than four points.

The problem with my current code (see below) is that it loops over everything again, but I only want it to repeat the calculation for clusters with more than four points. Consider the following example where using a distance of 40,000m, gives me 'cluster 1' with 5 points and 'cluster 2' with 2 points. At the moment, my code repeats the calculation for both these clusters. However what I want is for the code to only repeat the calculation for cluster 1. Iteration should continue until there is no cluster with more than four points.

This is my current code:

library(sf)
library(dplyr)
#I set the distance to 80,000 metres to begin with
d <- 80000

repeat{
  points <- points %>%
    st_as_sf(coords = c('LATITUDE', 'LONGITUDE')) %>%
    st_set_crs(4326)
  
  #Here I am calculating a distance matrix for all points
  dmatrix = st_distance(points)
  dmatrix = unclass(dmatrix)
  
  #Here is where I am halving the distance
  d = 0.5 * d
  #Here I am creating the clusters
  clustering_analysis = hclust(as.dist(dmatrix>d), method = "single")
  cluster = cutree(clustering_analysis, h=0.5)
  
  grouping_graph = st_sf(geom = do.call(c, lapply(1:max(cluster), 
  function(g). {st_union(points[cluster==g,])})))                                        
  
  grouping_graph$cluster = 1:nrow(grouping_graph)
  
  Mylist <- list()
  
  for(i in 1:dim(grouping_graph)[1])
  {
    Mylist[[i]] <- 
    do.call(rbind,lapply(grouping_graph$geom[[i]],data.frame))
    Mylist[[i]]$cluster <- grouping_graph$cluster[[i]]  
  }
  #Data is the desired output
  Data <- do.call(rbind,Mylist)
  print(Data)
  #DataTally counts the number of points in each cluster
  DataTally <- Data %>% group_by(cluster)%>%tally()
  #Here I am determining whether there are any clusters of more than 4 
  points
  DFTallyTrue = filter(DataTally, n>4) 
  
  if(nrow(DFTallyTrue) == 0){
    break
  }
}
print(Data)

Data is the desired output, and when you view Data you can see that no cluster has more than 4 points. Starting with a distance of 80000 means the loop repeats 5 times. If you print out each iteration of data you can see that some clusters have less than 4 points even in the first iteration, but the current code still loops back over all clusters.

Reproducible data:

structure(list(LATITUDE = c(32.70132, 34.74251, 32.55205, 32.64144, 
34.92803, 32.38016, 32.42127, 32.9095, 33.58092, 32.51617, 33.5726, 
33.83251, 34.65639, 34.27694, 33.73851, 33.95132, 31.35445, 34.05263, 
33.37959, 30.50248, 32.31561, 32.66919, 31.75039, 33.56986, 33.27091, 
33.93598, 32.30964, 31.09773, 32.26711, 33.54263, 34.72014, 34.78548, 
30.65705, 31.25939, 31.27647, 30.54322, 31.22416, 33.38549, 33.18338, 
31.16811, 32.38368, 32.36253, 31.14464), LONGITUDE = c(-85.52518, 
-86.88351, -87.34777, -85.3543, -87.81506, -86.2979, -87.0869, 
-85.75888, -86.27647, -86.21179, -86.65275, -87.2696, -85.72738, 
-87.71489, -86.48934, -86.29693, -88.22943, -87.55328, -85.31454, 
-87.79342, -86.88108, -86.26669, -88.04425, -86.44631, -87.74383, 
-87.72403, -86.28067, -85.4449, -87.62541, -86.56251, -86.48971, 
-85.59656, -88.24491, -86.60828, -86.18112, -88.22778, -85.63784, 
-86.03297, -87.55456, -85.37719, -86.38047, -86.21579, -86.86606
) ), .Names = c("LATITUDE", "LONGITUDE"), class = "data.frame", row.names = c(NA, 
-43L))
0

There are 0 answers