Generate population data with specific distribution in R

3.6k views Asked by At

I have a distribution of ages in a population.

For instance, you can imagine something like this:

Ages <24: 15%

Ages 25-49: 40%

Ages 50-60: 20%

Ages >60: 25%

I don't have the mean and standard deviation for each stratum/age group in the data. I am trying to generate a sample population of 1000 individuals where the generated data matches the distribution of ages shown above.

2

There are 2 answers

0
josliber On BEST ANSWER

Let's put this data in a more friendly format:

(dat <- data.frame(min=c(0, 25, 50, 60), max=c(25, 50, 60, 100), prop=c(0.15, 0.40, 0.20, 0.25)))
#   min max prop
# 1   0  25 0.15
# 2  25  50 0.40
# 3  50  60 0.20
# 4  60 100 0.25

We can easily sample 1000 rows of the table using the sample function:

set.seed(144)  # For reproducibility
rows <- sample(nrow(dat), 1000, replace=TRUE, prob=dat$prop)
table(rows)
# rows
#   1   2   3   4 
# 139 425 198 238 

To sample actual ages you will need to define a distribution over the ages represented by each row. A simple one would be uniformly distributed ages:

age <- round(dat$min[rows] + runif(1000) * (dat$max[rows] - dat$min[rows]))
table(age)
# age
#   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27 
#   2   5   5   3   7   7   9   6   7   6   1   7   7   5   5   6   2   4   6   7   4  11   8   2   3  10  11  13 
#  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55 
#  19  16  20  16  18  21  16  19  14  20  15  13  18  15  24  20  16  16  29  16  11  12  18  17  17  26  27  21 
#  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83 
#  17  26  11  13  20   3   8   9   6   4   3   3   5   4   3   3   5   8   3  13   5   6   4   7   9   9   6   4 
#  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
#   5   5   9   9   5   6   8   9   5   4   6   5   9   6   8   4   1 

Of course, if uniformly sampling the ages in each range is inappropriate in your application, then you would need to pick some other function to get ages from buckets.

0
oyeoyeoye On

This doesn't do exactly what you were looking for, but does help with the cut-offs. Hope it helps!

install.packages("truncnorm")
library(truncnorm)

set.seed(123)
pop <- 1000

ages <- rtruncnorm(n=pop, a=0, b=100, mean=40, sd=25) # ---> You can set your own mean and sd

summary(ages)