How to sample data based off the distribution of another dataset in R

5.9k views Asked by At

I would like to sample a large dataset based on the distribution of a smaller dataset in R. I have been searching for a solution for some time without success. I am relatively new in R so I apologize if this is straightforward. However, I have tried some solutions.

Here are some sample data. I'll call it observed and model:

# Set seed
set.seed(2)

# Create smaller observed data
Obs <- rnorm(1000, 5, 2.5)

# Create larger modeled data
set.seed(2)
Model <- rnorm(10000, 8, 1.5)

The distributions of the two datasets are as follows: enter image description here

Goal: I would like to sample the larger "model" dataset to match the smaller "observed". I understand that there are different data points involved so it won't be a direct match.

I have been reading up on the density() and sample() where I do the following:

# Obtain the density of the observed at the length of the model.
# Note: info on the sample() function stated the prob argument in the sample() function 
# must be the same length as what's being sampled. Thus, n=length(Model) below.

dens.obs <- density(Obs, n=length(Model))

# Sample the Model data the length(Obs) at the probability of density of the observed
set.seed(22)
SampleMod <- sample(Model, length(Obs), replace=FALSE, prob=dens.obs$y)

This gives me the new plot that looks very similar to the old (except for the tails): enter image description here

I was hoping for a better match. Therefore I started explored using the density function on the model data. See below:

# Density function on model, length of model
dens.mod <- density(Model, n=length(Model))

# Sample the density of the model $x at the density of the observed $ y
set.seed(22)
SampleMod3 <- sample(dens.mod$x, length(Obs), replace=FALSE, prob=dens.obs$y)

Here are two plots, the first is the same as the first sampled and the second is the second sampled: enter image description here

There is a more desirable shift in the right plot, which represents the sampled density of the modeled by the density of the observed. However, the data are not the same. That is, I did NOT sample the Modeled data. See below:

summary(SampleMod3 %in% Model)

produces:

   Mode   FALSE    NA's 
logical    1000       0 

Indicating that I did not sample the modeled data, but rather the density of the modeled data. Is it possible to sample a dataset based on the distribution of another dataset? Thank you in advance.

EDIT:

Thanks for all the help guys! Here is my plot using approxfun() function offered from danielson and supported by bethanyp.

enter image description here

Any help with understanding why the funky new distribution?

2

There are 2 answers

2
danielson On

Interesting question. I think this will help. First, it approximates the density function. Then, it samples from the Model points with the fitted density's probabilities.

predict_density = approxfun(dens.obs) #function that approximates dens.obs
#sample points from Model with probability distr. of dens.obs
SampleMod3 <- sample(Model, length(Obs), replace=FALSE, prob=predict_density(Model))
summary(SampleMod3 %in% Model)
   Mode    TRUE    NA's 
logical    1000       0 
0
sconfluentus On

I assume that in practice you are using a real set of non-randomly generated data. In which case the likely values of the different samples have a probability of coming up because random sampling method does not mean no pattern in the data. In the wilderness real things have real frequencies, which will show in your meta-sample.

So you should use the weighted probabilities in selecting your smaller sub-sample from the original.

Example the whole population {1,2,1,3,4,1,3} where probabilities for each number being drawn (remember the sum must equal 1): 1 : .4285 2 :.1429 3: .2857 4: .1429

if you use these weighted probabilities in the prob= my_freqs part of

sample(x, size, replace = FALSE, prob = my_freqs)

You will likely obtain a probability more inline with what you were expecting. But I am not 100% sure if this is what you are after.

In the random data, try set.seed(2) and see if telling R to use the seed used to generate those frequencies in the original set creation gets you closer to your goal.

I know that there is a universal random formula associated with each set. I would have to assume it is a set of frequency probabilities of a method of generating them for various sets of random methods, so it might help you o use that prior to sampling from the random sets.