I would like to sample a large dataset based on the distribution of a smaller dataset in R. I have been searching for a solution for some time without success. I am relatively new in R so I apologize if this is straightforward. However, I have tried some solutions.
Here are some sample data. I'll call it observed and model:
# Set seed
set.seed(2)
# Create smaller observed data
Obs <- rnorm(1000, 5, 2.5)
# Create larger modeled data
set.seed(2)
Model <- rnorm(10000, 8, 1.5)
The distributions of the two datasets are as follows:
Goal: I would like to sample the larger "model" dataset to match the smaller "observed". I understand that there are different data points involved so it won't be a direct match.
I have been reading up on the density()
and sample()
where I do the following:
# Obtain the density of the observed at the length of the model.
# Note: info on the sample() function stated the prob argument in the sample() function
# must be the same length as what's being sampled. Thus, n=length(Model) below.
dens.obs <- density(Obs, n=length(Model))
# Sample the Model data the length(Obs) at the probability of density of the observed
set.seed(22)
SampleMod <- sample(Model, length(Obs), replace=FALSE, prob=dens.obs$y)
This gives me the new plot that looks very similar to the old (except for the tails):
I was hoping for a better match. Therefore I started explored using the density function on the model data. See below:
# Density function on model, length of model
dens.mod <- density(Model, n=length(Model))
# Sample the density of the model $x at the density of the observed $ y
set.seed(22)
SampleMod3 <- sample(dens.mod$x, length(Obs), replace=FALSE, prob=dens.obs$y)
Here are two plots, the first is the same as the first sampled and the second is the second sampled:
There is a more desirable shift in the right plot, which represents the sampled density of the modeled by the density of the observed. However, the data are not the same. That is, I did NOT sample the Modeled data. See below:
summary(SampleMod3 %in% Model)
produces:
Mode FALSE NA's
logical 1000 0
Indicating that I did not sample the modeled data, but rather the density of the modeled data. Is it possible to sample a dataset based on the distribution of another dataset? Thank you in advance.
EDIT:
Thanks for all the help guys! Here is my plot using approxfun()
function offered from danielson and supported by bethanyp.
Any help with understanding why the funky new distribution?
Interesting question. I think this will help. First, it approximates the density function. Then, it samples from the Model points with the fitted density's probabilities.