Random sample of character vector, without elements prefixing one another

4k views Asked by At

Consider a character vector, pool, whose elements are (zero-padded) binary numbers with up to max_len digits.

max_len <- 4
pool <- unlist(lapply(seq_len(max_len), function(x) 
  do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))

pool
##  [1] "0"    "1"    "00"   "10"   "01"   "11"   "000"  "100"  "010"  "110" 
## [11] "001"  "101"  "011"  "111"  "0000" "1000" "0100" "1100" "0010" "1010"
## [21] "0110" "1110" "0001" "1001" "0101" "1101" "0011" "1011" "0111" "1111"

I'd like to sample n of these elements, with the constraint that none of the sampled elements are prefixes of any of the other sampled elements (i.e. if we sample 1101, we ban 1, 11, and 110, while if we sample 1, we ban those elements beginning with 1, such as 10, 11, 100, etc.).

The following is my attempt using while, but of course this is slow when n is large (or when it approaches 2^max_len).

set.seed(1)
n <- 10
chosen <- sample(pool, n)
while(any(rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1)) {
  prefixes <- rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1
  pool <- pool[rowSums(Vectorize(grepl, 'pattern')(
    paste0('^', chosen[!prefixes]), pool)) == 0]
  chosen <- c(chosen[!prefixes], sample(pool, sum(prefixes)))
}

chosen
## [1] "0100" "0101" "0001" "0011" "1000" "111"  "0000" "0110" "1100" "0111"

This can be improved slightly by initially removing from pool those elements whose inclusion would mean there are insufficient elements left in pool to take a total sample of size n. E.g., when max_len = 4 and n > 9, we can immediately remove 0 and 1 from pool, since by including either, the maximum sample would be 9 (either 0 and the eight 4-character elements beginning with 1, or 1 and the eight 4-character elements beginning with 0).

Based on this logic, we could then omit elements from pool, prior to taking the initial sample, with something like:

pool <- pool[
  nchar(pool) > tail(which(n > (2^max_len - rev(2^(0:max_len))[-1] + 1)), 1)]

Can anyone think of a better approach? I feel like I'm overlooking something much more simple.


EDIT

To clarify my intentions, I'll portray the pool as a set of branches, where the junctions and tips are the nodes (elements of pool). Assume the yellow node in the following figure (i.e. 010) was drawn. Now, the entire red "branch", which consists of nodes 0, 01, and 010, is removed from the pool. This is what I meant by forbidding sampling of nodes that "prefix" nodes already in our sample (as well as those already prefixed by nodes in our sample).

enter image description here

If the sampled node was half-way along a branch, such as 01 in the following figure, then all red nodes (0, 01, 010, and 011) are disallowed, since 0 prefixes 01, and 01 prefixes both 010 and 011.

enter image description here

I don't mean to sample either 1 or 0 at each junction (i.e. walking along the branches flipping coins at forks) - it's fine to have both in the sample, as long as: (1) parents (or grand-parents, etc.) or children (grandchildren, etc.) of the node aren't already sampled; and (2) upon sampling the node there will be sufficient nodes remaining to achieve the desired sample of size n.

In the second figure above, if 010 was the first pick, all nodes at black nodes are still (currently) valid, assuming n <= 4. For example, if n==4 and we sampled node 1 next (and so our picks now included 01 and 1), we would subsequently disallow node 00 (due to rule 2 above) but could still pick 000 and 001, giving us our 4-element sample. If n==5, on the other hand, node 1 would be disallowed at this stage.

8

There are 8 answers

11
BrodieG On BEST ANSWER

Intro

This is a numeric variation of the string algorithm we implemented in the other answer. It is faster and does not require either creating or sorting the pool.

Algorithm Outline

We can use integer numbers to represent your binary strings, which greatly simplifies the problem of pool generation and sequential elimination of values. For example, with max_len==3, we can take the number 1-- (where - represents padding) to mean 4 in decimal. Further, we can determine that the numbers that require elimination if we pick this number are those between 4 and 4 + 2 ^ x - 1. Here x is the number of padding elements (2 in this case), so the numbers to eliminate are between 4 and 4 + 2 ^ 2 - 1 (or between 4 and 7, expressed as 100, 110, and 111).

In order to match your problem exactly we need a little wrinkle since you treat numbers that would be potentially the same in binary as being the distinct for some portions of your algorithm. For example, 100, 10-, and 1-- are all the same number, but need to be treated differently in your scheme. In a max_len==3 world, we have 8 possible numbers, but 14 possible representations:

0 - 000: 0--, 00-
1 - 001:
2 - 010: 01-
3 - 011:
4 - 100: 1--, 10-
5 - 101:
6 - 110: 11-
7 - 111:

So 0 and 4 have three possible encodings, 2 and 6 have two, and all others just one. We need generate an integer pool that represents the higher probability of selection for the numbers with multiple representation, as well as mechanism for tracking how many blanks the number includes. We can do this by appending a few bits at the end of the number to indicate the weightings we want. So our numbers become (we use two bits here):

jbaum | int | bin | bin.enc | int.enc    
  0-- |   0 | 000 |   00000 |       0
  00- |   0 | 000 |   00001 |       1      
  000 |   0 | 000 |   00010 |       2      
  001 |   1 | 001 |   00100 |       3      
  01- |   2 | 010 |   01000 |       4  
  010 |   2 | 010 |   01001 |       5  
  011 |   3 | 011 |   01101 |       6  
  1-- |   4 | 100 |   10000 |       7  
  10- |   4 | 100 |   10001 |       8  
  100 |   4 | 100 |   10010 |       9  
  101 |   5 | 101 |   10100 |      10  
  11- |   6 | 110 |   11000 |      11   
  110 |   6 | 110 |   11001 |      12   
  111 |   7 | 111 |   11100 |      13

A few useful properties:

  • enc.bits represents how many bits we need for encoding (two in this case)
  • int.enc %% enc.bits tells us how many of the numbers are explicitly specified
  • int.enc %/% enc.bits returns int
  • int * 2 ^ enc.bits + explicitly.specified returns int.enc

Note that explicitly.specified here is between 0 and max_len - 1 in our implementation since there is always at least one digit specified. We now have an encoding that fully represents your data structure using integers only. We can sample from integers and reproduce your desired outcomes, with the correct weights, etc. One limitation of this approach is that we use 32 bit integers in R, and we must reserve some bits for the encoding, so we limit ourselves to pools of max_len==25 or so. You could go bigger if you used integers specified by double precision floating points, but we didn't do that here.

Avoiding Duplicate Picks

There are two rough ways to ensure we don't pick the same value twice

  1. Track what values remain available to pick, and sample randomly from those
  2. Sample randomly from all possible values, and then check whether the value has been picked previously, and if it has, sample again

While the first option seems like the cleanest, it actually is very expensive computationally. It requires either doing a vector scan of all possible values for each pick to pre-disqualify picked values, or create a shrinking vector containing non-disqualified values. The shrinking option is only more efficient than the vector scan if the vector is made to shrink by reference via C code, but even then it requires repeated translations of potentially large portions of the vector, and it requires C.

Here we use method #2. This allows us to randomly shuffle the universe of possible values once, and then pick each value in sequence, check that it has not been disqualified, and if it has, pick another one, etc. This works because it is trivial to check whether a value has been picked as a result of our value encoding; we can infer the location of a value in a sorted table based on the value alone. So we record the status of each value in a sorted table, and can either update or lookup that state via direct index access (no scan required).

Examples

The implementation of this algorithm in base R is available as a gist. This particular implementation only pulls complete draws. Here is a sample of 10 draws of 8 elements from a max_len==4 pool:

# each column represents a draw from a `max_len==4` pool

set.seed(6); replicate(10, sample0110b(4, 8))
     [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]   [,10] 
[1,] "1000" "1"    "0011" "0010" "100"  "0011" "0"    "011"  "0100" "1011"
[2,] "111"  "0000" "1101" "0000" "0110" "0100" "1000" "00"   "0101" "1001"
[3,] "0011" "0110" "1001" "0100" "0000" "0101" "1101" "1111" "10"   "1100"
[4,] "0100" "0010" "0000" "0101" "1101" "101"  "1011" "1101" "0110" "1101"
[5,] "101"  "0100" "1100" "1100" "0101" "1001" "1001" "1000" "1111" "1111"
[6,] "110"  "0111" "1011" "111"  "1011" "110"  "1111" "0100" "0011" "000" 
[7,] "0101" "0101" "111"  "011"  "1010" "1000" "1100" "101"  "0001" "0101"
[8,] "011"  "0001" "01"   "1010" "0011" "1110" "1110" "1001" "110"  "1000"

We also originally had two implementations that relied on method #1 for avoiding duplicates, one in base R, and one in C, but even the C version isn't as fast as the new base R version when n is large. Those function do implement the ability to draw incomplete draws, so we provide them here for reference:

Comparative benchmarks

Here is a set of benchmarks comparing several of the functions that show up in this Q/A. Times in milliseconds. The brodie.b version is the one described in this answer. brodie is the original implementation, brodie.C is the original with some C. All of these enforce the requirement for complete samples. brodie.str is the string based version in the other answer.

   size    n  jbaum josilber  frank tensibai brodie.b brodie brodie.C brodie.str
1     4   10     11        1      3        1        1      1        1          0
2     4   50      -        -      -        1        -      -        -          1
3     4  100      -        -      -        1        -      -        -          0
4     4  256      -        -      -        1        -      -        -          1
5     4 1000      -        -      -        1        -      -        -          1
6     8   10      1      290      6        3        2      2        1          1
7     8   50    388        -      8        8        3      4        3          4
8     8  100  2,506        -     13       18        6      7        5          5
9     8  256      -        -     22       27       13     14       12          6
10    8 1000      -        -      -       27        -      -        -          7
11   16   10      -        -    615      688       31     61       19        424
12   16   50      -        -  2,123    2,497       28    276       19      1,764
13   16  100      -        -  4,202    4,807       30    451       23      3,166
14   16  256      -        - 11,822   11,942       40  1,077       43      8,717
15   16 1000      -        - 38,132   44,591       83  3,345      130     27,768

This scales relatively well to larger pools

system.time(sample0110b(18, 100000))
   user  system elapsed 
  8.441   0.079   8.527 

Benchmark notes:

  • frank, and brodie (minus brodie.str) don't require any pre-generation of pools, which will affect comparisons (see below)
  • Josilber is the LP version
  • jbaum is the OP example
  • tensibai is slightly modified to exit instead of failing if pool is empty
  • not set-up to run python, so can't quite compare / account for buffering
  • - represent either infeasible options or too slow to time reasonably

The timings do not include drawing the pools (0.8, 2.5, 401 milliseconds respectively for size 4, 8, and 16), which is necessary for the jbaum, josilber, and brodie.str runs, or sorting them (0.1, 2.7, 3700 milliseconds for size 4, 8, and 16), which is necessary for brodie.str in addition to the draw. Whether you want to include these or not depends on how many times you run the function for a particular pool. Also, there almost certainly are better ways of generating / sorting the pools.

These are the middle time of three runs with microbenchmark. The code is available as a gist, though note you'll have to load the sample0110b, sample0110, sample01101, and sample01 functions beforehand.

4
josliber On

If you don't want to generate the set of all possible tuples and then randomly sample (which as you note may be infeasible for large input sizes), another option is to draw a single sample with integer programming. Basically you could assign each element in pool a random value and then select the feasible tuple with the largest sum of values. This should give each tuple an equal probability of being selected because they are all the same size and their values were selected randomly. The constraints of the model would ensure that none of the disallowed pairs of tuples are selected and that the correct number of elements are selected.

Here's a solution with the lpSolve package:

library(lpSolve)
sample.lp <- function(pool, max_len) {
  pool <- sort(pool)
  pml <- max(nchar(pool))
  runs <- c(rev(cumsum(2^(seq(pml-1)))), 0)
  banned.from <- rep(seq(pool), runs[nchar(pool)])
  banned.to <- banned.from + unlist(lapply(runs[nchar(pool)], seq_len))
  banned.constr <- matrix(0, nrow=length(banned.from), ncol=length(pool))
  banned.constr[cbind(seq(banned.from), banned.from)] <- 1
  banned.constr[cbind(seq(banned.to), banned.to)] <- 1
  mod <- lp(direction="max",
            objective.in=runif(length(pool)),
            const.mat=rbind(banned.constr, rep(1, length(pool))),
            const.dir=c(rep("<=", length(banned.from)), "=="),
            const.rhs=c(rep(1, length(banned.from)), max_len),
            all.bin=TRUE)
  pool[which(mod$solution == 1)]
}
set.seed(144)
pool <- unlist(lapply(seq_len(4), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
sample.lp(pool, 4)
# [1] "0011" "010"  "1000" "1100"
sample.lp(pool, 8)
# [1] "0000" "0100" "0110" "1001" "1010" "1100" "1101" "1110"

This seems to scale to fairly large pools. For instance, it takes a little over 2 seconds to get a sample of length 20 from a pool of size 510:

pool <- unlist(lapply(seq_len(8), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
length(pool)
# [1] 510
system.time(sample.lp(pool, 20))
#    user  system elapsed 
#   0.232   0.008   0.239 

If you needed really, really huge problem sizes solved, then you could move from the non-open source solvers that ship with lpSolve to a commercial solver like gurobi or cplex (not free in general, but free for academic use).

5
BrodieG On

You can sort the pool to help decide what elements to disqualify. For example, looking at a three element sorted pool:

 [1] "0"   "00"  "000" "001" "01"  "010" "011" "1"   "10"  "100" "101" "11" 
[13] "110" "111"

I can tell that I can disqualify anything that follows my selected item that has more characters than my item up to first item that has the same number or fewer characters. For example, if I select "01", I can immediately see that the next two items ("010", "011") need to be removed, but not the one after that because "1" has fewer characters. Removing the "0" afterwards is easy. Here is an implementation:

library(fastmatch)  # could use `match`, but we repeatedly search against same hash

# `pool` must be sorted!

sample01 <- function(pool, n) {
  picked <- logical(length(pool))
  chrs <- nchar(pool)
  pick.list <- character(n)
  pool.seq <- seq_along(pool)

  for(i in seq(n)) {
    # Make sure pool not exhausted

    left <- which(!picked)
    left.len <- length(left)
    if(!length(left)) break

    # Sample from pool

    seq.left <- seq.int(left)
    pool.left <- pool[left]
    chrs.left <- chrs[left]
    pick <- sample(length(pool.left), 1L)

    # Find all the elements with more characters that are disqualified
    # and store their indices in `valid` (bad name...)

    valid.tmp <- chrs.left > chrs.left[[pick]] & seq.left > pick
    first.invalid <- which(!valid.tmp & seq.left > pick)
    valid <- if(length(first.invalid)) {
      pick:(first.invalid[[1L]] - 1L)
    } else pick:left.len

    # Translate back to original pool indices since we're working on a 
    # subset in `pool.left`

    pool.seq.left <- pool.seq[left]
    pool.idx <- pool.seq.left[valid]
    val <- pool[[pool.idx[[1L]]]]

    # Record the picked value, and all the disqualifications

    pick.list[[i]] <- val
    picked[pool.idx] <- TRUE

    # Disqualify shorter matches

    to.rem <- vapply(
      seq.int(nchar(val) - 1), substr, character(1L), x=val, start=1L
    )
    to.rem.idx <- fmatch(to.rem, pool, nomatch=0)
    picked[to.rem.idx] <- TRUE  
  }
  pick.list  
}

And a function to make sorted pools (exactly like your code, but returns sorted):

make_pool <- function(size)
  sort(
    unlist(
      lapply(
        seq_len(size), 
        function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))) 
  ) ) )

Then, using the max_len 3 pool (useful for visually inspecting things behave as expected):

pool3 <- make_pool(3)
set.seed(1)
sample01(pool3, 8)
# [1] "001" "1"   "010" "011" "000" ""    ""    ""   
sample01(pool3, 8)
# [1] "110" "111" "011" "10"  "00"  ""    ""    ""   
sample01(pool3, 8)
# [1] "000" "01"  "11"  "10"  "001" ""    ""    ""   
sample01(pool3, 8)
# [1] "011" "101" "111" "001" "110" "100" "000" "010"    

Notice in the last case we get all 3 digit binary combinations (2 ^ 3) because by chance we kept sampling from the 3 digit ones. Also, with just a 3 size pool there are many samplings that prevent a full 8 draws; you could address this with your suggestion of eliminating the combinations that prevent full draws from the pool.

This is pretty fast. Looking at the max_len==9 example that took 2 seconds with the alternate solution:

pool9 <- make_pool(9)
microbenchmark(sample01(pool9, 4))
# Unit: microseconds
#                expr     min      lq  median      uq     max neval
#  sample01(pool9, 4) 493.107 565.015 571.624 593.791 983.663   100    

About half a millisecond. You can reasonably attempt fairly large pools too:

pool16 <- make_pool(16)  # 131K entries
system.time(sample01(pool16, 100))
#  user  system elapsed 
# 3.407   0.146   3.552 

This is not very fast, but we're talking about a pool with 130K items. There is also potential for additional optimization.

Note the sorting step gets to be relatively slow for large pools, but I'm not counting it because you only ever need to do it once, and you could probably come up with a reasonable algorithm to produce the pool pre sorted.

There is also the possibility of a much faster integer to binary based approach that I explored in a now deleted answer, but that requires a fair bit more work to tie out to exactly what you're looking for.

10
Frank On

Mapping ids to strings. You can map numbers to your 0/1 vectors, as @BrodieG mentioned:

# some key objects

n_pool      = sum(2^(1:max_len))      # total number of indices
cuts        = cumsum(2^(1:max_len-1)) # new group starts
inds_by_g   = mapply(seq,cuts,cuts*2) # indices grouped by length

# the mapping to strings (one among many possibilities)

library(data.table)
get_01str <- function(id,max_len){
    cuts = cumsum(2^(1:max_len-1))
    g    = findInterval(id,cuts)
    gid  = id-cuts[g]+1

    data.table(g,gid)[,s:=
      do.call(paste,c(list(sep=""),lapply(
        seq(g[1]), 
        function(x) (gid-1) %/% 2^(x-1) %% 2
      )))
    ,by=g]$s      
} 

Finding ids to drop. We'll sequentially drop ids from the sampling pool:

 # the mapping from one index to indices of nixed strings

get_nixstrs <- function(g,gid,max_len){

    cuts         = cumsum(2^(1:max_len-1))
    gids_child   = {
      x = gid%%2^sequence(g-1)
      ifelse(x,x,2^sequence(g-1))
    }
    ids_child    = gids_child+cuts[sequence(g-1)]-1

    ids_parent   = if (g==max_len) gid+cuts[g]-1 else {

      gids_par       = vector(mode="list",max_len)
      gids_par[[g]]  = gid
      for (gg in seq(g,max_len-1)) 
        gids_par[[gg+1]] = c(gids_par[[gg]],gids_par[[gg]]+2^gg)

      unlist(mapply(`+`,gids_par,cuts-1))
    }

    c(ids_child,ids_parent)
}

The indices are grouped by g, the number of characters, nchar(get_01str(id)). Because the indices are sorted by g, g=findInterval(id,cuts) is a faster route.

An index in group g, 1 < g < max_len has one "child" index of size g-1 and two parent indices of size g+1. For each child node, we take its child node until we hit g==1; and for each parent node, we take their pair of parent nodes until we hit g==max_len.

The structure of the tree is simplest in terms of the identifier within the group, gid. gid maps to the two parents, gid and gid+2^g; and reversing this mapping one finds the child.

Sampling

drawem <- function(n,max_len){
    cuts        = cumsum(2^(1:max_len-1))
    inds_by_g   = mapply(seq,cuts,cuts*2)

    oklens = (1:max_len)[ n <= 2^max_len*(1-2^(-(1:max_len)))+1 ]
    okinds = unlist(inds_by_g[oklens])

    mysamp = rep(0,n)
    for (i in 1:n){

        id        = if (length(okinds)==1) okinds else sample(okinds,1)
        g         = findInterval(id,cuts)
        gid       = id-cuts[g]+1
        nixed     = get_nixstrs(g,gid,max_len)

        # print(id); print(okinds); print(nixed)

        mysamp[i] = id
        okinds    = setdiff(okinds,nixed)
        if (!length(okinds)) break
    }

    res <- rep("",n)
    res[seq.int(i)] <- get_01str(mysamp[seq.int(i)],max_len)
    res
}

The oklens part integrates the OP's idea for omitting strings that are guaranteed to make the sampling impossible. However, even doing this, we may follow a sampling path that leaves us with no more options. Taking the OP's example of max_len=4 and n=10, we know we must drop 0 and 1 from consideration, but what happens if our first four draws are 00, 01, 11 and 10? Oh well, I guess we are out of luck. This is why you should actually define sampling probabilities. (The OP has another idea, for determining, at each step, which nodes will lead to an impossible state, but that seems a tall order.)

Illustration

# how the indices line up

n_pool = sum(2^(1:max_len)) 
pdt <- data.table(id=1:n_pool)
pdt[,g:=findInterval(id,cuts)]
pdt[,gid:=1:.N,by=g]
pdt[,s:=get_01str(id,max_len)]

# example run

set.seed(4); drawem(5,5)
# [1] "01100" "1"     "0001"  "0101"  "00101"

set.seed(4); drawem(8,4)
# [1] "1100" "0"    "111"  "101"  "1101" "100"  ""     ""  

Benchmarks (older than those in @BrodieG's answer)

require(rbenchmark)
max_len = 8
n = 8

benchmark(
      jos_lp     = {
        pool <- unlist(lapply(seq_len(max_len),
          function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
        sample.lp(pool, n)},
      bro_string = {pool <- make_pool(max_len);sample01(pool,n)},
      fra_num    = drawem(n,max_len),
      replications=5)[1:5]
#         test replications elapsed relative user.self
# 2 bro_string            5    0.05      2.5      0.05
# 3    fra_num            5    0.02      1.0      0.02
# 1     jos_lp            5    1.56     78.0      1.55

n = 12
max_len = 12
benchmark(
  bro_string={pool <- make_pool(max_len);sample01(pool,n)},
  fra_num=drawem(n,max_len),
  replications=5)[1:5]
#         test replications elapsed relative user.self
# 1 bro_string            5    0.54     6.75      0.51
# 2    fra_num            5    0.08     1.00      0.08

Other answers. There are two other answers:

jos_enum = {pool <- unlist(lapply(seq_len(max_len), 
    function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
  get.template(pool, n)}
bro_num  = sample011(max_len,n)    

I left out @josilber's enumeration method because it took far too long; and @BrodieG's numeric/index method because it didn't work at the time, but now does. See @BrodieG's updated answer for more benchmarking.

Speed vs correctness. While @josilber's answers are far slower (and for the enumeration method, apparently much more memory-intensive), they are guaranteed to draw a sample of size n on the first try. With @BrodieG's string method or this answer, you'll have to resample again and again in the hope of drawing a full n. With large max_len, that shouldn't be such a problem, I guess.

This answer scales better than bro_string because it doesn't require the construction of pool up front.

0
josliber On

One approach would be to simply generate all possible tuples of the appropriate size with an iterative approach:

  1. Build all tuples of size 1 (all elements in pool)
  2. Take a cross product with elements in pool
  3. Remove any tuple using the same element of pool more than once
  4. Remove any exact duplicate of another tuple
  5. Remove any tuple with a pair that can't be used together
  6. Rinse and repeat until you get the appropriate tuple size

This is runnable for the sizes given (pool of length 30, max_len 4):

get.template <- function(pool, max_len) {
  banned <- which(outer(paste0('^', pool), pool, Vectorize(grepl)), arr.ind=T)
  banned <- banned[banned[,1] != banned[,2],]
  banned <- paste(banned[,1], banned[,2])
  vals <- matrix(seq(length(pool)))
  for (k in 2:max_len) {
    vals <- cbind(vals[rep(1:nrow(vals), each=length(pool)),],
                  rep(1:length(pool), nrow(vals)))
    # Can't sample same value more than once
    vals <- vals[apply(vals, 1, function(x) length(unique(x)) == length(x)),]
    # Sort rows to ensure unique only
    vals <- t(apply(vals, 1, sort))
    vals <- unique(vals)
    # Can't have banned pair
    combos <- combn(ncol(vals), 2)
    for (k in seq(ncol(combos))) {
        c1 <- combos[1,k]
        c2 <- combos[2,k]
        vals <- vals[!paste(vals[,c1], vals[,c2]) %in% banned,]
    }
  }
  return(matrix(pool[vals], nrow=nrow(vals)))
}

max_len <- 4
pool <- unlist(lapply(seq_len(max_len), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
system.time(template <- get.template(pool, 4))
#   user  system elapsed 
#  4.549   0.050   4.614 

Now you can sample as many times as desired from the rows of template (which will be very fast), and this will be the same as randomly sampling from the space defined.

5
swenzel On

It is in python and not in r but jbaums said it would be okay.

So here is my contribution, see comments in the source for explanation of the crucial parts.
I'm still working on the analytical solution to determine the amount of possible combinations c for a tree of depth t and S samples, so I can improve the function combs. Maybe someone has it? This is really the bottleneck now.

Sampling 100 nodes from a tree with depth 16 takes around 8ms on my laptop. Not the first time, but it gets faster up to a certain point the more you sample because combBuffer gets filled.

import random


class Tree(object):
    """
    :param level: The distance of this node from the root.
    :type level: int
    :param parent: This trees parent node
    :type parent: Tree
    :param isleft: Determines if this is a left or a right child node. Can be
                   omitted if this is the root node.
    :type isleft: bool

    A binary tree representing possible strings which match r'[01]{1,n}'. Its
    purpose is to be able to sample n of its nodes where none of the sampled
    nodes' ids is a prefix for another one.
    It is possible to change Tree.maxdepth and then reuse the root. All
    children are created ON DEMAND, which means everything is lazily evaluated.
    If the Tree gets too big anyway, you can call 'prune' on any node to delete
    its children.

        >>> t = Tree()
        >>> t.sample(8, toString=True, depth=3)
        ['111', '110', '101', '100', '011', '010', '001', '000']
        >>> Tree.maxdepth = 2
        >>> t.sample(4, toString=True)
        ['11', '10', '01', '00']
    """

    maxdepth = 10
    _combBuffer = {}

    def __init__(self, level=0, parent=None, isleft=None):
        self.parent = parent
        self.level = level
        self.isleft = isleft
        self._left = None
        self._right = None

    @classmethod
    def setMaxdepth(cls, depth):
        """
        :param depth: The new depth
        :type depth: int

        Sets the maxdepth of the Trees. This basically is the depth of the root
        node.
        """
        if cls.maxdepth == depth:
            return

        cls.maxdepth = depth

    @property
    def left(self):
        """This tree's left child, 'None' if this is a leave node"""
        if self.depth == 0:
            return None

        if self._left is None:
            self._left = Tree(self.level+1, self, True)
        return self._left

    @property
    def right(self):
        """This tree's right child, 'None' if this is a leave node"""
        if self.depth == 0:
            return None

        if self._right is None:
            self._right = Tree(self.level+1, self, False)
        return self._right

    @property
    def depth(self):
        """
        This tree's depth. (maxdepth-level)
        """
        return self.maxdepth-self.level

    @property
    def id(self):
        """
        This tree's id, string of '0's and '1's equal to the path from the root
        to this subtree. Where '1' means going left and '0' means going right.
        """
        # level 0 is the root node, it has no id
        if self.level == 0:
            return ''
        # This takes at most Tree.maxdepth recursions. Therefore
        # it is save to do it this way. We could also save each nodes
        # id once it is created to avoid recreating it every time, however
        # this won't save much time but use quite some space.
        return self.parent.id + ('1' if self.isleft else '0')

    @property
    def leaves(self):
        """
        The amount of leave nodes, this tree has. (2**depth)
        """
        return 2**self.depth

    def __str__(self):
        return self.id

    def __len__(self):
        return 2*self.leaves-1

    def prune(self):
        """
        Recursively prune this tree's children.
        """
        if self._left is not None:
            self._left.prune()
            self._left.parent = None
            self._left = None

        if self._right is not None:
            self._right.prune()
            self._right.parent = None
            self._right = None

    def combs(self, n):
        """
        :param n: The amount of samples to be taken from this tree
        :type n: int
        :returns: The amount of possible combinations to choose n samples from
                  this tree

        Determines recursively the amount of combinations of n nodes to be
        sampled from this tree.
        Subsequent calls with same n on trees with same depth will return the
        result from the previous computation rather than computing it again.

            >>> t = Tree()
            >>> Tree.maxdepth = 4
            >>> t.combs(16)
            1
            >>> Tree.maxdepth = 3
            >>> t.combs(6)
            58
        """

        # important for the amount of combinations is only n and the depth of
        # this tree
        key = (self.depth, n)

        # We use the dict to save computation time. Calling the function with
        # equal values on equal nodes just returns the alrady computed value if
        # possible.
        if key not in Tree._combBuffer:
            leaves = self.leaves

            if n < 0:
                N = 0
            elif n == 0 or self.depth == 0 or n == leaves:
                N = 1
            elif n == 1:
                return (2*leaves-1)
            else:
                if n > leaves/2:
                    # if n > leaves/2, at least n-leaves/2 have to stay on
                    # either side, otherweise the other one would have to
                    # sample more nodes than possible.
                    nMin = n-leaves/2
                else:
                    nMin = 0

                # The rest n-2*nMin is the amount of samples that are free to
                # fall on either side
                free = n-2*nMin

                N = 0
                # sum up the combinations of all possible splits
                for addLeft in range(0, free+1):
                    nLeft = nMin + addLeft
                    nRight = n - nLeft
                    N += self.left.combs(nLeft)*self.right.combs(nRight)

            Tree._combBuffer[key] = N
            return N
        return Tree._combBuffer[key]

    def sample(self, n, toString=False, depth=None):
        """
        :param n: How may samples to take from this tree
        :type n: int
        :param toString: If 'True' result will direclty be turned into a list
                         of strings
        :type toString: bool
        :param depth: If not None, will overwrite Tree.maxdepth
        :type depth: int or None
        :returns: List of n nodes sampled from this tree
        :throws ValueError: when n is invalid

        Takes n random samples from this tree where none of the sample's ids is
        a prefix for another one's.

        For an example see Tree's docstring.
        """
        if depth is not None:
            Tree.setMaxdepth(depth)

        if toString:
            return [str(e) for e in self.sample(n)]

        if n < 0:
            raise ValueError('Negative sample size is not possible!')

        if n == 0:
            return []

        leaves = self.leaves
        if n > leaves:
            raise ValueError(('Cannot sample {} nodes, with only {} ' +
                              'leaves!').format(n, leaves))

        # Only one sample to choose, that is nice! We are free to take any node
        # from this tree, including this very node itself.
        if n == 1 and self.level > 0:
            # This tree has 2*leaves-1 nodes, therefore
            # the probability that we keep the root node has to be
            # 1/(2*leaves-1) = P_root. Lets create a random number from the
            # interval [0, 2*leaves-1).
            # It will be 0 with probability 1/(2*leaves-1)
            P_root = random.randint(0, len(self)-1)
            if P_root == 0:
                return [self]
            else:
                # The probability to land here is 1-P_root

                # A child tree's size is (leaves-1) and since it obeys the same
                # rule as above, the probability for each of its nodes to
                # 'survive' is 1/(leaves-1) = P_child.
                # However all nodes must have equal probability, therefore to
                # make sure that their probability is also P_root we multiply
                # them by 1/2*(1-P_root). The latter is already done, the
                # former will be achieved by the next condition.
                # If we do everything right, this should hold:
                # 1/2 * (1-P_root) * P_child = P_root

                # Lets see...
                # 1/2 * (1-1/(2*leaves-1)) * (1/leaves-1)
                # (1-1/(2*leaves-1)) * (1/(2*(leaves-1)))
                # (1-1/(2*leaves-1)) * (1/(2*leaves-2))
                # (1/(2*leaves-2)) - 1/((2*leaves-2) * (2*leaves-1))
                # (2*leaves-1)/((2*leaves-2) * (2*leaves-1)) - 1/((2*leaves-2) * (2*leaves-1))
                # (2*leaves-2)/((2*leaves-2) * (2*leaves-1))
                # 1/(2*leaves-1)
                # There we go!
                if random.random() < 0.5:
                    return self.right.sample(1)
                else:
                    return self.left.sample(1)

        # Now comes the tricky part... n > 1 therefore we are NOT going to
        # sample this node. Its probability to be chosen is 0!
        # It HAS to be 0 since we are definitely sampling from one of its
        # children which means that this node will be blocked by those samples.
        # The difficult part now is to prove that the sampling the way we do it
        # is really random.

        if n > leaves/2:
            # if n > leaves/2, at least n-leaves/2 have to stay on either
            # side, otherweise the other one would have to sample more
            # nodes than possible.
            nMin = n-leaves/2
        else:
            nMin = 0
        # The rest n-2*nMin is the amount of samples that are free to fall
        # on either side
        free = n-2*nMin

        # Let's have a look at an example, suppose we were to distribute 5
        # samples among two children which have 4 leaves each.
        # Each child has to get at least 1 sample, so the free samples are 3.
        # There are 4 different ways to split the samples among the
        # children (left, right):
        # (1, 4), (2, 3), (3, 2), (4, 1)
        # The amount of unique sample combinations per child are
        # (7, 1), (11, 6), (6, 11), (1, 7)
        # The amount of total unique samples per possible split are
        #   7   ,   66  ,   66  ,    7
        # In case of the first and last split, all samples have a probability
        # of 1/7, this was already proven above.
        # Lets suppose we are good to go and the per sample probabilities for
        # the other two cases are (1/11, 1/6) and (1/6, 1/11), this way the
        # overall per sample probabilities for the splits would be:
        #  1/7  ,  1/66 , 1/66 , 1/7
        # If we used uniform random to determine the split, all splits would be
        # equally probable and therefore be multiplied with the same value (1/4)
        # But this would mean that NOT every sample is equally probable!
        # We need to know in advance how many sample combinations there will be
        # for a given split in order to find out the probability to choose it.
        # In fact, due to the restrictions, this becomes very nasty to
        # determine. So instead of solving it analytically, I do it numerically
        # with the method 'combs'. It gives me the amount of possible sample
        # combinations for a certain amount of samples and a given tree depth.
        # It will return 146 for this node and 7 for the outer and 66 for the
        # inner splits.
        # What we now do is, we take a number from [0, 146).
        # if it is smaller than 7, we sample from the first split,
        # if it is smaller than 7+66, we sample from the second split,
        # ...
        # This way we get the probabilities we need.

        r = random.randint(0, self.combs(n)-1)
        p = 0
        for addLeft in xrange(0, free+1):
            nLeft = nMin + addLeft
            nRight = n - nLeft

            p += (self.left.combs(nLeft) * self.right.combs(nRight))
            if r < p:
                return self.left.sample(nLeft) + self.right.sample(nRight)
        assert False, ('Something really strange happend, p did not sum up ' +
                       'to combs or r was too big')


def main():
    """
    Do a microbenchmark.
    """
    import timeit
    i = 1
    main.t = Tree()
    template = ' {:>2}  {:>5} {:>4}  {:<5}'
    print(template.format('i', 'depth', 'n', 'time (ms)'))
    N = 100
    for depth in [4, 8, 15, 16, 17, 18]:
        for n in [10, 50, 100, 150]:
            if n > 2**depth:
                time = '--'
            else:
                time = timeit.timeit(
                    'main.t.sample({}, depth={})'.format(n, depth), setup=
                    'from __main__ import main', number=N)*1000./N
            print(template.format(i, depth, n, time))
            i += 1


if __name__ == "__main__":
    main()

The benchmark output:

  i  depth    n  time (ms)
  1      4   10  0.182511806488
  2      4   50  --   
  3      4  100  --   
  4      4  150  --   
  5      8   10  0.397620201111
  6      8   50  1.66054964066
  7      8  100  2.90236949921
  8      8  150  3.48146915436
  9     15   10  0.804011821747
 10     15   50  3.7428188324
 11     15  100  7.34910964966
 12     15  150  10.8230614662
 13     16   10  0.804491043091
 14     16   50  3.66818904877
 15     16  100  7.09567070007
 16     16  150  10.404779911
 17     17   10  0.865840911865
 18     17   50  3.9999294281
 19     17  100  7.70257949829
 20     17  150  11.3758206367
 21     18   10  0.915451049805
 22     18   50  4.22935962677
 23     18  100  8.22361946106
 24     18  150  12.2081303596

10 samples of size 10 from a tree with depth 10:

['1111010111', '1110111010', '1010111010', '011110010', '0111100001', '011101110', '01110010', '01001111', '0001000100', '000001010']
['110', '0110101110', '0110001100', '0011110', '0001111011', '0001100010', '0001100001', '0001100000', '0000011010', '0000001111']
['11010000', '1011111101', '1010001101', '1001110001', '1001100110', '10001110', '011111110', '011001100', '0101110000', '001110101']
['11111101', '110111', '110110111', '1101010101', '1101001011', '1001001100', '100100010', '0100001010', '0100000111', '0010010110']
['111101000', '1110111101', '1101101', '1101000000', '1011110001', '0111111101', '01101011', '011010011', '01100010', '0101100110']
['1111110001', '11000110', '1100010100', '101010000', '1010010001', '100011001', '100000110', '0100001111', '001101100', '0001101101']
['111110010', '1110100', '1101000011', '101101', '101000101', '1000001010', '0111100', '0101010011', '0101000110', '000100111']
['111100111', '1110001110', '1100111111', '1100110010', '11000110', '1011111111', '0111111', '0110000100', '0100011', '0010110111']
['1101011010', '1011111', '1011100100', '1010000010', '10010', '1000010100', '0111011111', '01010101', '001101', '000101100']
['111111110', '111101001', '1110111011', '111011011', '1001011101', '1000010100', '0111010101', '010100110', '0100001101', '0010000000']
0
Tensibai On

I found the problem interesting so I tried and get this with really low skills in R (so this may be improved):

Newer edited version, thanks to @Franck advice:

library(microbenchmark)
library(lineprof)

max_len <- 16
pool <- unlist(lapply(seq_len(max_len), function(x) 
  do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
n<-100

library(stringr)
tree_sample <- function(samples,pool) {
  results <- vector("integer",samples)
  # Will be used on a regular basis, compute it in advance
  PoolLen <- str_length(pool)
  # Make a mask vector based on the length of each entry of the pool
  masks <- strtoi(str_pad(str_pad("1",PoolLen,"right","1"),max_len,"right","0"),base=2)

  # Make an integer vector from "0" right padded orignal: for max_len=4 and pool entry "1" we get "1000" => 8
  # This will allow to find this entry as parent of 10 and 11 which become "1000" and "1100", as integer 8 and 12 respectively
  # once bitwise "anded" with the repective mask "1000" the first bit is striclty the same, so it's a parent.
  integerPool <- strtoi(str_pad(pool,max_len,"right","0"),base=2)

  # Create a vector to filter the available value to sample
  ok <- rep(TRUE,length(pool))

  #Precompute the result of the bitwise and betwwen our integer pool and the masks   
  MaskedPool <- bitwAnd(integerPool,masks)

  while(samples) {
    samp <- sample(pool[ok],1) # Get a sample
    results[samples] <- samp # Store it as result
    ok[pool == samp] <- FALSE # Remove it from available entries

    vsamp <- strtoi(str_pad(samp,max_len,"right","0"),base=2) # Get the integer value of the "0" right padded sample
    mlen <- str_length(samp) # Get sample len

    #Creation of unitary mask to remove childs of sample
    mask <- strtoi(paste0(rep(1:0,c(mlen,max_len-mlen)),collapse=""),base=2)

    # Get the result of bitwise And between the integerPool and the sample mask 
    FilterVec <- bitwAnd(integerPool,mask)

    # Get the bitwise and result of the sample and it's mask
    Childm <- bitwAnd(vsamp,mask)

    ok[FilterVec == Childm] <- FALSE  # Remove from available entries the childs of the sample
    ok[MaskedPool == bitwAnd(vsamp,masks)] <- FALSE # compare the sample with all the masks to remove parents matching

    samples <- samples -1
  }
  print(results)
}
microbenchmark(tree_sample(n,pool),times=10L)

The main idea is to use the bitmask comparison to know if one sample is parent of the other (common bit part), if yes suppress this element from the pool.

It takes now 1,4 s to draw 100 samples from a pool of length 16 on my machine.

0
Jim Wood On

Intro

I found this question so interesting that I felt compelled to think it over, and ultimately provide my own answer. Since the algorithm I arrived at does not immediately follow from the problem description, I will start by explaining how I arrived at this solution, and then provide a sample implementation in C++ (I have never written R).

Development of the Solution

At First Glance

Reading the problem description was initially confusing, but as soon as I saw the edit with the pictures of the trees, I immediately understood the problem and my intuition suggested that a binary tree was also a solution: build a tree (a collection of trees of size 1), and break the tree into a collection of smaller trees after eliminating the branches and ancestors upon making a selection.

Although this initially seemed good, the selection process and maintenance of the collection would be a pain. Still, the tree seemed like it should play a big part in any solution.

Revision 1

Don't break up the tree. Instead have a Boolean data payload at each node indicating if it had already been eliminated. This leaves only one tree that maintains form.

Notice however, that this is not just any binary tree, it is actually a complete binary tree of depth max_len-1.

Revision 2

A complete binary tree can be represented very nicely as an array. The typical array representation used the breadth-first search of the tree, has the following properties:

Let x be the array index.
x = 0 is the root of the entire tree
left_child(x) = 2x + 1
right_child(x) = 2x + 2
parent(x) = floor((n-1)/2)

In the illustration below each node is labelled with its array index: breadth-first

As an array, this takes up less memory (no more pointers), the memory used is contiguous (good for caching), and can go entirely on the stack rather than the heap (assuming your language gives you a choice). Of course some conditions apply here, in particular the size of the array. I'll come back to that later.

Just as in Revision 1, the data stored in the array will be boolean: true for available, false for unavailable. Since the root node is not actually a valid choice, index 0 should be initialized to false. The problem of how to make a selection still remains:

As indicies are elimitated, it is trivial to keep track of how many are eliminated, and thus how many remain. Choose a random number in that range, then walk the array until having seen that many indicies set to true (the current index included). The index arrived at is then the selection to make. Select until n indicies are selected, or there is nothing left to select.

This is a complete algorithm that will work, however there is room for improvement in the selection process, and there is also the practical issue of size that hasn't been addressed: the array size in would be O(2^n). As n gets larger, first the caching benefit disappears, then the data starts being paged to disk, and at some point it becomes impossible to store at all.

Revision 3

I decided to tackle the easier problem first: improving the selection process.

Scanning the array left-to-right is wasteful. It may be more efficient to track the ranges which have been eliminated rather than checking and finding several falses in a row. However our tree representation is not ideal for this, since few of the nodes which would be eliminated each round would be contiguous in the array.

By rearranging how the array maps to the tree, it is possible however to better exploit this. In particular, let us use pre-order depth-first search rather than the breadth-first search. In order to do so, the tree needs to be fixed in size, which is the case for this problem. It's also less obvious how the indices of child and parent nodes are connected mathematically.

depth-first

By using this arrangement, every choice that is not a leaf is guaranteed to eliminate a contiguous range: its subtree.

Revision 4

By tracking the eliminated ranges, there is no longer a need for the true/false data, and thus no need for the array or tree at all. At each random draw, the eliminated ranges can be used to quickly find the node to select. All ancestors and the entire subtree are eliminated, and can be represented as ranges which can easily be merged with the others.

The final task is to turn the selected nodes into the string representation the OP wanted. That is quite easy as this binary tree still maintains a strict order: traversing from the root, all elements >= the right child are on the right, and the others are on the left. Thus searching the tree will provide both the list of ancestors and the binary string by appending '0' when traversing left; or a '1' when traversing right.

Sample Implementation

#include <stdint.h>
#include <algorithm>
#include <cmath>
#include <list>
#include <deque>
#include <ctime>
#include <cstdlib>
#include <iostream>

/*
 * A range of values of the form (a, b), where a <= b, and is inclusive.
 * Ex (1,1) is the range from 1 to 1 (ie: just 1)
 */
class Range
{
private:
    friend bool operator< (const Range& lhs, const Range& rhs);
    friend std::ostream& operator<<(std::ostream& os, const Range& obj);

    int64_t m_start;
    int64_t m_end;

public:
    Range(int64_t start, int64_t end) : m_start(start), m_end(end) {}
    int64_t getStart() const { return m_start; }
    int64_t getEnd() const { return m_end; }
    int64_t size() const { return m_end - m_start + 1; }
    bool canMerge(const Range& other) const {
        return !((other.m_start > m_end + 1) || (m_start > other.m_end + 1));
    }
    int64_t merge(const Range& other) {
        int64_t change = 0;
        if (m_start > other.m_start) {
            change += m_start - other.m_start;
            m_start = other.m_start;
        }
        if (other.m_end > m_end) {
            change += other.m_end - m_end;
            m_end = other.m_end;
        }
        return change;
    }
};

inline bool operator< (const Range& lhs, const Range& rhs){return lhs.m_start < rhs.m_start;}
std::ostream& operator<<(std::ostream& os, const Range& obj) {
    os << '(' << obj.m_start << ',' << obj.m_end << ')';
    return os;
}

/*
 * Stuct to allow returning of multiple values
 */
struct NodeInfo {
    int64_t subTreeSize;
    int64_t depth;
    std::list<int64_t> ancestors;
    std::string representation;
};

/*
 * Collection of functions representing a complete binary tree
 * as an array created using pre-order depth-first search,
 * with 0 as the root.
 * Depth of the root is defined as 0.
 */
class Tree
{
private:
    int64_t m_depth;
public:
    Tree(int64_t depth) : m_depth(depth) {}
    int64_t size() const {
        return (int64_t(1) << (m_depth+1))-1;
    }
    int64_t getDepthOf(int64_t node) const{
        if (node == 0) { return 0; }
        int64_t searchDepth = m_depth;
        int64_t currentDepth = 1;
        while (true) {
            int64_t rightChild = int64_t(1) << searchDepth;
            if (node == 1 || node == rightChild) {
                break;
            } else if (node > rightChild) {
                node -= rightChild;
            } else {
                node -= 1;
            }
            currentDepth += 1;
            searchDepth -= 1;
        }
        return currentDepth;
    }
    int64_t getSubtreeSizeOf(int64_t node, int64_t nodeDepth = -1) const {
        if (node == 0) {
            return size();
        }
        if (nodeDepth == -1) {
            nodeDepth = getDepthOf(node);
        }
        return (int64_t(1) << (m_depth + 1 - nodeDepth)) - 1;
    }
    int64_t getLeftChildOf(int64_t node, int64_t nodeDepth = -1) const {
        if (nodeDepth == -1) {
            nodeDepth = getDepthOf(node);
        }
        if (nodeDepth == m_depth) { return -1; }
        return node + 1;
    }
    int64_t getRightChildOf(int64_t node, int64_t nodeDepth = -1) const {
        if (nodeDepth == -1) {
            nodeDepth = getDepthOf(node);
        }
        if (nodeDepth == m_depth) { return -1; }
        return node + 1 + ((getSubtreeSizeOf(node, nodeDepth) - 1) / 2);
    }
    NodeInfo getNodeInfo(int64_t node) const {
        NodeInfo info;
        int64_t depth = 0;
        int64_t currentNode = 0;
        while (currentNode != node) {
            if (currentNode != 0) {
                info.ancestors.push_back(currentNode);
            }
            int64_t rightChild = getRightChildOf(currentNode, depth);
            if (rightChild == -1) {
                break;
            } else if (node >= rightChild) {
                info.representation += '1';
                currentNode = rightChild;
            } else {
                info.representation += '0';
                currentNode = getLeftChildOf(currentNode, depth);
            }
            depth++;
        }
        info.depth = depth;
        info.subTreeSize = getSubtreeSizeOf(node, depth);
        return info;
    }
};

// random selection amongst remaining allowed nodes
int64_t selectNode(const std::deque<Range>& eliminationList, int64_t poolSize, std::mt19937_64& randomGenerator)
{
    std::uniform_int_distribution<> randomDistribution(1, poolSize);
    int64_t selection = randomDistribution(randomGenerator);
    for (auto const& range : eliminationList) {
        if (selection >= range.getStart()) { selection += range.size(); }
        else { break; }
    }
    return selection;
}

// determin how many nodes have been elimintated
int64_t countEliminated(const std::deque<Range>& eliminationList)
{
    int64_t count = 0;
    for (auto const& range : eliminationList) {
        count += range.size();
    }
    return count;
}

// merge all the elimination ranges to listA, and return the number of new elimintations
int64_t mergeEliminations(std::deque<Range>& listA, std::deque<Range>& listB) {
    if(listB.empty()) { return 0; }
    if(listA.empty()) {
        listA.swap(listB);
        return countEliminated(listA);
    }

    int64_t newEliminations = 0;
    int64_t x = 0;
    auto listA_iter = listA.begin();
    auto listB_iter = listB.begin();
    while (listB_iter != listB.end()) {
        if (listA_iter == listA.end()) {
            listA_iter = listA.insert(listA_iter, *listB_iter);
            x = listB_iter->size();
            assert(x >= 0);
            newEliminations += x;
            ++listB_iter;
        } else if (listA_iter->canMerge(*listB_iter)) {
            x = listA_iter->merge(*listB_iter);
            assert(x >= 0);
            newEliminations += x;
            ++listB_iter;
        } else if (*listB_iter < *listA_iter) {
            listA_iter = listA.insert(listA_iter, *listB_iter) + 1;
            x = listB_iter->size();
            assert(x >= 0);
            newEliminations += x;
            ++listB_iter;
        } else if ((listA_iter+1) != listA.end() && listA_iter->canMerge(*(listA_iter+1))) {
            listA_iter->merge(*(listA_iter+1));
            listA_iter = listA.erase(listA_iter+1);
        } else {
            ++listA_iter;
        }
    }
    while (listA_iter != listA.end()) {
        if ((listA_iter+1) != listA.end() && listA_iter->canMerge(*(listA_iter+1))) {
            listA_iter->merge(*(listA_iter+1));
            listA_iter = listA.erase(listA_iter+1);
        } else {
            ++listA_iter;
        }
    }
    return newEliminations;
}

int main (int argc, char** argv)
{
    std::random_device rd;
    std::mt19937_64 randomGenerator(rd());

    int64_t max_len = std::stoll(argv[1]);
    int64_t num_samples = std::stoll(argv[2]);

    int64_t samplesRemaining = num_samples;
    Tree tree(max_len);
    int64_t poolSize = tree.size() - 1;
    std::deque<Range> eliminationList;
    std::deque<Range> eliminated;
    std::list<std::string> foundList;

    while (samplesRemaining > 0 && poolSize > 0) {
        // find a valid node
        int64_t selectedNode = selectNode(eliminationList, poolSize, randomGenerator);
        NodeInfo info = tree.getNodeInfo(selectedNode);
        foundList.push_back(info.representation);
        samplesRemaining--;

        // determine which nodes this choice eliminates
        eliminated.clear();
        for( auto const& ancestor : info.ancestors) {
            Range r(ancestor, ancestor);
            if(eliminated.empty() || !eliminated.back().canMerge(r)) {
                eliminated.push_back(r);
            } else {
                eliminated.back().merge(r);
            }
        }
        Range r(selectedNode, selectedNode + info.subTreeSize - 1);
        if(eliminated.empty() || !eliminated.back().canMerge(r)) {
            eliminated.push_back(r);
        } else {
            eliminated.back().merge(r);
        }

        // add the eliminated nodes to the existing list
        poolSize -= mergeEliminations(eliminationList, eliminated);
    }

    // Print some stats
    // std::cout << "tree: " << tree.size() << " samplesRemaining: "
    //                       << samplesRemaining << " poolSize: "
    //                       << poolSize << " samples: " << foundList.size()
    //                       << " eliminated: "
    //                       << countEliminated(eliminationList) << std::endl;

    // Print list of binary strings
    // std::cout << "list:";
    // for (auto const& s : foundList) {
    //  std::cout << " " << s;
    // }
    // std::cout << std::endl;
}

Additional Thoughts

This algorithm will scale extremely well for max_len. Scaling with n is not very good, but based on my own profiling it seems to do better than the other solutions.

This algorithm could be modified with little effort to allow for strings containing more than just '0' and '1'. More possible letters in the words increases the fan-out of the tree, and would eliminate a wider range with each selection - still with all nodes in each subtree remaining contiguous.