SAS Proc IML Simulate from empirical data with limits

44 views Asked by At

This might sound bonkers, but looking to see if there are any ideas on how to do this.

I have N categories (say 7) where a set number of people (say 1000) have to be allocated. I know from historical data the minimum and maximum for each category (there is limited historical data, say 15 samples, so I have data that looks like this - if I had a larger sample, I would try to generate a distribution for each category from all the samples, but there isn't.

-Year 1: [78 97 300 358 132 35 0]
-Year 2: [24 74 346 300 148 84 22]
-.
-.
-Year 15:[25 85 382 302 146 52 8] 

The min and max for each category over these 15 years of data is:
Min:   [25  74  252 278 112 27 0 ]
Max:   [132 141 382 360 177 84 22]

I am trying to scale this using simulation - by allocating 1000 to each category within the min and max limits, and repeating it. The only condition is that the sum of the allocation across the seven categories in each simulation has to sum to 1000.

Any ideas would be greatly appreciated!

1

There are 1 answers

2
Rick On

The distribution you want is called the multinomial distribution. You can use the RandMultinomial function in SAS/IML to produce random samples from the multinomial distribution. To use the multinomial distribution, you need to know the probability of an individual in each category. If this probability has not changed over time, the best estimate of this probability is to take the average proportion in each category.
Thus, I would recommend using ALL the data to estimate the probability, not just max and min:

proc iml;
X = {...};       /* X is a 15 x 7 matrix of counts, each row is a year */
mean = mean(X);
p = mean / sum(mean);
/* simulate new counts by using the multinomial distribution */
numSamples = 10;
SampleSize = 1000;
Y = randmultinomial(numSamples, SampleSize, p);
print Y;

Now, if you insist on using the max/min, you could use the midrange to estimate the most likely value and use that to estimate the probabilty, as follows:

Min = {25  74  252 278 112 27 0};
Max = {132 141 382 360 177 84 22};
/* use midrange to estimate probabilities */
midrange = (Min + Max)/2;
p = midrange / sum(midrange);
/* now use RandMultinomial, as before */

If you use the second method, there is no guarantee that the simulated values will not exceed the Min/Max values, although in practice many of the samples will obey that criterion.

Personally, I advocate the first method, which uses the average count. Or you can use a time-weighted count, if you think recent observations are more relevant than observations from 15 years ago.