Consider a character vector, pool
, whose elements are (zero-padded) binary numbers with up to max_len
digits.
max_len <- 4
pool <- unlist(lapply(seq_len(max_len), function(x)
do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
pool
## [1] "0" "1" "00" "10" "01" "11" "000" "100" "010" "110"
## [11] "001" "101" "011" "111" "0000" "1000" "0100" "1100" "0010" "1010"
## [21] "0110" "1110" "0001" "1001" "0101" "1101" "0011" "1011" "0111" "1111"
I'd like to sample n
of these elements, with the constraint that none of the sampled elements are prefixes of any of the other sampled elements (i.e. if we sample 1101
, we ban 1
, 11
, and 110
, while if we sample 1
, we ban those elements beginning with 1
, such as 10
, 11
, 100
, etc.).
The following is my attempt using while
, but of course this is slow when n
is large (or when it approaches 2^max_len
).
set.seed(1)
n <- 10
chosen <- sample(pool, n)
while(any(rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1)) {
prefixes <- rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1
pool <- pool[rowSums(Vectorize(grepl, 'pattern')(
paste0('^', chosen[!prefixes]), pool)) == 0]
chosen <- c(chosen[!prefixes], sample(pool, sum(prefixes)))
}
chosen
## [1] "0100" "0101" "0001" "0011" "1000" "111" "0000" "0110" "1100" "0111"
This can be improved slightly by initially removing from pool
those elements whose inclusion would mean there are insufficient elements left in pool
to take a total sample of size n
. E.g., when max_len = 4
and n > 9
, we can immediately remove 0
and 1
from pool
, since by including either, the maximum sample would be 9 (either 0
and the eight 4-character elements beginning with 1
, or 1
and the eight 4-character elements beginning with 0
).
Based on this logic, we could then omit elements from pool
, prior to taking the initial sample, with something like:
pool <- pool[
nchar(pool) > tail(which(n > (2^max_len - rev(2^(0:max_len))[-1] + 1)), 1)]
Can anyone think of a better approach? I feel like I'm overlooking something much more simple.
EDIT
To clarify my intentions, I'll portray the pool as a set of branches, where the junctions and tips are the nodes (elements of pool
). Assume the yellow node in the following figure (i.e. 010) was drawn. Now, the entire red "branch", which consists of nodes 0, 01, and 010, is removed from the pool. This is what I meant by forbidding sampling of nodes that "prefix" nodes already in our sample (as well as those already prefixed by nodes in our sample).
If the sampled node was half-way along a branch, such as 01 in the following figure, then all red nodes (0, 01, 010, and 011) are disallowed, since 0 prefixes 01, and 01 prefixes both 010 and 011.
I don't mean to sample either 1 or 0 at each junction (i.e. walking along the branches flipping coins at forks) - it's fine to have both in the sample, as long as: (1) parents (or grand-parents, etc.) or children (grandchildren, etc.) of the node aren't already sampled; and (2) upon sampling the node there will be sufficient nodes remaining to achieve the desired sample of size n
.
In the second figure above, if 010 was the first pick, all nodes at black nodes are still (currently) valid, assuming n <= 4
. For example, if n==4
and we sampled node 1 next (and so our picks now included 01 and 1), we would subsequently disallow node 00 (due to rule 2 above) but could still pick 000 and 001, giving us our 4-element sample. If n==5
, on the other hand, node 1 would be disallowed at this stage.
Intro
This is a numeric variation of the string algorithm we implemented in the other answer. It is faster and does not require either creating or sorting the pool.
Algorithm Outline
We can use integer numbers to represent your binary strings, which greatly simplifies the problem of pool generation and sequential elimination of values. For example, with
max_len==3
, we can take the number1--
(where-
represents padding) to mean4
in decimal. Further, we can determine that the numbers that require elimination if we pick this number are those between4
and4 + 2 ^ x - 1
. Herex
is the number of padding elements (2 in this case), so the numbers to eliminate are between4
and4 + 2 ^ 2 - 1
(or between4
and7
, expressed as100
,110
, and111
).In order to match your problem exactly we need a little wrinkle since you treat numbers that would be potentially the same in binary as being the distinct for some portions of your algorithm. For example,
100
,10-
, and1--
are all the same number, but need to be treated differently in your scheme. In amax_len==3
world, we have 8 possible numbers, but 14 possible representations:So 0 and 4 have three possible encodings, 2 and 6 have two, and all others just one. We need generate an integer pool that represents the higher probability of selection for the numbers with multiple representation, as well as mechanism for tracking how many blanks the number includes. We can do this by appending a few bits at the end of the number to indicate the weightings we want. So our numbers become (we use two bits here):
A few useful properties:
enc.bits
represents how many bits we need for encoding (two in this case)int.enc %% enc.bits
tells us how many of the numbers are explicitly specifiedint.enc %/% enc.bits
returnsint
int * 2 ^ enc.bits + explicitly.specified
returnsint.enc
Note that
explicitly.specified
here is between0
andmax_len - 1
in our implementation since there is always at least one digit specified. We now have an encoding that fully represents your data structure using integers only. We can sample from integers and reproduce your desired outcomes, with the correct weights, etc. One limitation of this approach is that we use 32 bit integers in R, and we must reserve some bits for the encoding, so we limit ourselves to pools ofmax_len==25
or so. You could go bigger if you used integers specified by double precision floating points, but we didn't do that here.Avoiding Duplicate Picks
There are two rough ways to ensure we don't pick the same value twice
While the first option seems like the cleanest, it actually is very expensive computationally. It requires either doing a vector scan of all possible values for each pick to pre-disqualify picked values, or create a shrinking vector containing non-disqualified values. The shrinking option is only more efficient than the vector scan if the vector is made to shrink by reference via C code, but even then it requires repeated translations of potentially large portions of the vector, and it requires C.
Here we use method #2. This allows us to randomly shuffle the universe of possible values once, and then pick each value in sequence, check that it has not been disqualified, and if it has, pick another one, etc. This works because it is trivial to check whether a value has been picked as a result of our value encoding; we can infer the location of a value in a sorted table based on the value alone. So we record the status of each value in a sorted table, and can either update or lookup that state via direct index access (no scan required).
Examples
The implementation of this algorithm in base R is available as a gist. This particular implementation only pulls complete draws. Here is a sample of 10 draws of 8 elements from a
max_len==4
pool:We also originally had two implementations that relied on method #1 for avoiding duplicates, one in base R, and one in C, but even the C version isn't as fast as the new base R version when
n
is large. Those function do implement the ability to draw incomplete draws, so we provide them here for reference:Comparative benchmarks
Here is a set of benchmarks comparing several of the functions that show up in this Q/A. Times in milliseconds. The
brodie.b
version is the one described in this answer.brodie
is the original implementation,brodie.C
is the original with some C. All of these enforce the requirement for complete samples.brodie.str
is the string based version in the other answer.This scales relatively well to larger pools
Benchmark notes:
-
represent either infeasible options or too slow to time reasonablyThe timings do not include drawing the pools (
0.8
,2.5
,401
milliseconds respectively for size4
,8
, and16
), which is necessary for thejbaum
,josilber
, andbrodie.str
runs, or sorting them (0.1
,2.7
,3700
milliseconds for size4
,8
, and16
), which is necessary forbrodie.str
in addition to the draw. Whether you want to include these or not depends on how many times you run the function for a particular pool. Also, there almost certainly are better ways of generating / sorting the pools.These are the middle time of three runs with
microbenchmark
. The code is available as a gist, though note you'll have to load thesample0110b
,sample0110
,sample01101
, andsample01
functions beforehand.