Generate vector of 'random' proportions of a given length within specific boundaries

522 views Asked by At

I want to generate a vector of a given length, e.g., n = 5. Each value in the vector should be a proportion (i.e., a value between 0 and 1) so that across n elements they sum up to 1.

Unfortunately, I have two vectors: one (mymins) defines the allowed lower boundaries of each proportion and the other (mymaxs) defines the allowed top boundaries of each proportion.

In my example below the desired proportion for the first element is allowed to fall anywhere between 0.3 and 0.9. And for the last element, the desired proportion is allowed to fall between 0.05 and 0.7.

mymins <- c(0.3, 0.1, 0, 0.2, 0.05)
mymaxs <- c(0.9, 1, 1, 1, 0.7)

Let's assume that mymins are always 'legitimate' (i.e., their sum is never larger than 1).

How could I find a set of 5 proportions such that they all sum to 1 but lie within the boundaries?

Here is what I tried:

n = 5
mydif <- mymaxs - mymins    # possible range for each proportion
myorder <- rank(mydif)      # order those differences from smallest to largest
mytarget <- sum(mydif)      # sum up the 5 ranges
x <- sort(runif(n))[myorder] # generate 5 random values an sort them in the order of mydif
x2 <- mymins + x / sum(x) * mytarget  # rescale random values to sum up to mytarget and add them to mymins
x3 <- x2/sum(x2)             # rescale x2 to sum up to 1

As you can see, I am not very far - because after rescaling some values are outside of their allowed boundaries.

I should probably also mention that I need this operation to be fast - because I am using it in an optimization loop.

I also tried to find a solution using optim, however the problem is that it always finds the same solution - and I need to generate a DIFFERENT solutions every time I find the proporotion:

    myfun <- function(x) {
      x <- round(x, 4)
      abovemins <- x - mymins
      n_belowmins <- sum(abovemins < 0)
      if (n_belowmins > 0) return(100000)
      belowmax <- x - mymaxs
      n_abovemax <- sum(belowmax > 0)
      if (n_abovemax > 0) return(100000)
      mydist <- abs(sum(x) - 1)
      return(mydist)
    }

    myopt <- optim(par = mymins + 0.01, fn = myfun)
    myopt$par
    sum(round(myopt$par, 4))

Thank you very much for your suggestions!

3

There are 3 answers

0
r2evans On BEST ANSWER

Because you need 5 random numbers to sum to 1, you really only have 4 independent numbers and one dependent number.

mymins <- c(0.3, 0.1, 0, 0.2, 0.05)
mymaxs <- c(0.9, 1, 1, 1, 0.7)

set.seed(42)
iter <- 1000
while(iter > 0 &&
        (
          (1 - sum(x <- runif(4, mymins[-5], mymaxs[-5]))) < mymins[5] ||
            (1 - sum(x)) > mymaxs[5]
        )
      ) iter <- iter - 1
if (iter < 1) {
  # failed
  stop("unable to find something within 1000 iterations")
} else {
  x <- c(x, 1-sum(x))
}

sum(x)
# [1] 1
all(mymins <= x & x <= mymaxs)
# [1] TRUE
x
# [1] 0.37732330 0.21618036 0.07225311 0.24250359 0.09173965

The reason I use iter there is to make sure you don't take an "infinite" amount of time to find something. If your mymins and mymaxs combination make this mathematically infeasible (as your first example was), then you don't need to spin forever. If it is mathematically improbable to find it in a reasonable amount of time, you need to weigh how long you want to do this.

One reason this takes so long is that we are iteratively pulling entropy. If you expect this to go for a long time, then it is generally better to pre-calculate as much as you think you'll need (overall) and run things as a matrix.

set.seed(42)
n <- 10000
m <- matrix(runif(prod(n, length(mymins)-1)), nrow = n)
m <- t(t(m) * (mymaxs[-5] - mymins[-5]) + mymins[-5])
remainders <- (1 - rowSums(m))
ind <- mymins[5] <= remainders & remainders <= mymaxs[5]
table(ind)
# ind
# FALSE  TRUE 
#  9981    19 
m <- cbind(m[ind,,drop=FALSE], remainders[ind])
nrow(m)
# [1] 19
rowSums(m)
#  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
head(m)
#           [,1]      [,2]       [,3]      [,4]       [,5]
# [1,] 0.3405821 0.1306152 0.05931363 0.2199362 0.24955282
# [2,] 0.3601376 0.1367465 0.20235704 0.2477507 0.05300821
# [3,] 0.4469526 0.1279795 0.02265618 0.2881733 0.11423845
# [4,] 0.5450527 0.1029903 0.07503371 0.2052423 0.07168103
# [5,] 0.3161519 0.1469783 0.15290720 0.3268470 0.05711557
# [6,] 0.4782448 0.1185735 0.01664063 0.2178225 0.16871845
all(
  mymins[1] <= m[,1] & m[,1] <= mymaxs[1],
  mymins[2] <= m[,2] & m[,2] <= mymaxs[2],
  mymins[3] <= m[,3] & m[,3] <= mymaxs[3],
  mymins[4] <= m[,4] & m[,4] <= mymaxs[4],
  mymins[5] <= m[,5] & m[,5] <= mymaxs[5]
)
# [1] TRUE

This time it took 10000 attempts to make 19 valid combinations. It might take more or fewer attempts based on randomness, so ymmv with regards to how much you need to pre-generate.

2
Allan Cameron On

Perhaps its better to think of this in a different way. Your samples actually need to sum to 0.35 (which is 1 - sum(mymins)), then be added on to the minimum values

constrained_sample <- function(mymins, mymaxs)
{
 sizes <- mymaxs - mymins
 samp <- (runif(5) * sizes)
 samp/sum(samp) * (1 - sum(mymins)) + mymins
}

It works like this:

constrained_sample(mymins, mymaxs)
#> [1] 0.31728333 0.17839397 0.07196067 0.29146744 0.14089459

We can test this works by running the following loop, which will print a message to the console if any of the criteria aren't met:

for(i in 1:1000)
{
  test <- constrained_sample(mymins, mymaxs)
  if(!all(test > mymins) | !all(test < mymaxs) | abs(sum(test) - 1) > 1e6) cat("failure")
}

This throws no errors, since the criteria are always met. However, as @GregorThomas points out, the bounds aren't realistic in this case. We can see a range of solutions constrained by your conditions using a boxplot:

samp <- constrained_sample(mymins, mymaxs)
for(i in 1:999) samp <- rbind(samp, constrained_sample(mymins, mymaxs))
df <- data.frame(val = c(samp[,1], samp[,2], samp[,3], samp[,4], samp[,5]), 
                 index = factor(rep(1:5, each = 1000)))
ggplot(df, aes(x = index, y = val)) + geom_boxplot()

enter image description here

9
Gregor Thomas On

If your example bounds are realistic, we can refine them quite a bit, narrowing the range of possibilities. For the current version of the question with:

mymins = c(0.3, 0.1, 0, 0.2, 0.05)
mymaxs = c(0.9, 1, 1, 1, 0.7)

What's the max for x[1]? Well, if x[2:5] take on minimum values, they will add up to 0.1 + 0 + 0.2 + 0.05 = 0.35, so based on the other mins only we know that max value for x[1] is 1 - 0.35 = 0.65. The 0.9 in mymaxs is way too high.

We can calculate the actual max values taking the minimum of the max values based on the minimums and the mymaxs vector:

new_max = pmin(mymaxs, 1 - (sum(mymins) - mymins))
new_max
# [1] 0.65 0.45 0.35 0.55 0.40

We can similarly revise the min bounds, though in this case even the revised max bounds new_max are high enough that it would have any impact on the minimums.

new_min = pmax(mymins, 1 - (sum(new_max) - new_max))
new_min
# [1] 0.30 0.10 0.00 0.20 0.05

With these adjustments, we should be able to see easily if any solutions are possible (all(new_min < new_max)). And then generating random numbers as in r2evans's answer should go much quicker using the new bounds.