Sample from a dataset to generate a subset that has similar properties to another dataset

58 views Asked by At

Let's say I have a large dataset of numeric values:

big_dataset = rnorm(n = 500, mean = 20, sd = 10)

I want to pull out a subset of observations from big_dataset that have similar values (within 5 units) to those in an existing and independent dataset:

independent_dataset = runif(n = 20, min = 0, max = 100)

That way both my newly made subset and independent_dataset are comparable. What's a good way to do this?

All I can come up with is an apply loop that searches iteratively for something close to each value in independent_dataset but I thought there might be a more elegant way...

2

There are 2 answers

2
ThomasIsCoding On

Probably you can try

big_dataset[max.col(-abs(outer(independent_dataset, big_dataset, `-`)))]
0
jblood94 On

Using the radius search type and distance in the RANN::nn2:

apply(
  RANN::nn2(
    big_dataset,
    independent_dataset,
    length(big_dataset),
    searchtype = "radius",
    radius = 5
  )[[1]], 1, \(x) x[x != 0L]
)

Each element of the big_similar list will contain the values within 5 of the corresponding value in independent_dataset.


Another option, using data.table:

library(data.table)
i <- seq_along(independent_dataset)
dt <- data.table(
  v = c(independent_dataset - 5, big_dataset, independent_dataset + 5),
  i = c(i, integer(length(big_dataset)), i)
)
setorder(dt, v)[,j := .I]
setorder(dt[i != 0L
  , {
    r <- j
    .(vInd = independent_dataset[i], vBig = .(dt[r[1]:r[2]][i == 0L, v]))
  }, i
], i)[,i := NULL][]
#>         vInd                                                      vBig
#>  1: 39.05293 34.40019,34.42997,34.48287,34.94542,35.02730,35.05976,...
#>  2: 52.24089                                48.30979,50.67553,56.02477
#>  3: 85.36959                                                          
#>  4: 59.40061                                                  56.02477
#>  5: 10.30652 5.344624,5.365774,5.466547,5.533230,5.787786,5.817441,...
#>  6: 96.68002                                                          
#>  7: 39.81330 34.94542,35.02730,35.05976,35.67742,35.72234,35.75184,...
#>  8: 70.32747                                                          
#>  9: 79.96525                                                          
#> 10: 42.92298 37.98900,38.30225,38.40954,38.82281,38.88412,39.10196,...
#> 11: 42.01954 37.24025,37.34674,37.50477,37.62358,37.69603,37.75336,...
#> 12: 40.99353 36.70751,36.91647,36.92494,37.24025,37.34674,37.50477,...
#> 13: 36.90136 32.14804,32.19208,32.22644,32.38380,32.43632,32.57504,...
#> 14: 58.42638                                                  56.02477
#> 15: 75.50957                                                          
#> 16: 39.57639 34.94542,35.02730,35.05976,35.67742,35.72234,35.75184,...
#> 17: 81.14091                                                          
#> 18: 93.58480                                                          
#> 19: 42.86739 37.98900,38.30225,38.40954,38.82281,38.88412,39.10196,...
#> 20: 63.53754