How to sample data based off the distribution of another dataset in R

Question

How to sample data based off the distribution of another dataset in R

5.9k views Asked by Phil At 05 January 2017 at 21:43

I would like to sample a large dataset based on the distribution of a smaller dataset in R. I have been searching for a solution for some time without success. I am relatively new in R so I apologize if this is straightforward. However, I have tried some solutions.

Here are some sample data. I'll call it observed and model:

# Set seed
set.seed(2)

# Create smaller observed data
Obs <- rnorm(1000, 5, 2.5)

# Create larger modeled data
set.seed(2)
Model <- rnorm(10000, 8, 1.5)

The distributions of the two datasets are as follows:

Goal: I would like to sample the larger "model" dataset to match the smaller "observed". I understand that there are different data points involved so it won't be a direct match.

I have been reading up on the density() and sample() where I do the following:

# Obtain the density of the observed at the length of the model.
# Note: info on the sample() function stated the prob argument in the sample() function 
# must be the same length as what's being sampled. Thus, n=length(Model) below.

dens.obs <- density(Obs, n=length(Model))

# Sample the Model data the length(Obs) at the probability of density of the observed
set.seed(22)
SampleMod <- sample(Model, length(Obs), replace=FALSE, prob=dens.obs$y)

This gives me the new plot that looks very similar to the old (except for the tails):

I was hoping for a better match. Therefore I started explored using the density function on the model data. See below:

# Density function on model, length of model
dens.mod <- density(Model, n=length(Model))

# Sample the density of the model $x at the density of the observed $ y
set.seed(22)
SampleMod3 <- sample(dens.mod$x, length(Obs), replace=FALSE, prob=dens.obs$y)

Here are two plots, the first is the same as the first sampled and the second is the second sampled:

There is a more desirable shift in the right plot, which represents the sampled density of the modeled by the density of the observed. However, the data are not the same. That is, I did NOT sample the Modeled data. See below:

summary(SampleMod3 %in% Model)

produces:

   Mode   FALSE    NA's 
logical    1000       0

Indicating that I did not sample the modeled data, but rather the density of the modeled data. Is it possible to sample a dataset based on the distribution of another dataset? Thank you in advance.

EDIT:

Thanks for all the help guys! Here is my plot using approxfun() function offered from danielson and supported by bethanyp.

Any help with understanding why the funky new distribution?

Original Q&A

There are 2 answers

**danielson** · Answer 1 · 2017-01-05T22:45:18+00:00

Interesting question. I think this will help. First, it approximates the density function. Then, it samples from the Model points with the fitted density's probabilities.

predict_density = approxfun(dens.obs) #function that approximates dens.obs
#sample points from Model with probability distr. of dens.obs
SampleMod3 <- sample(Model, length(Obs), replace=FALSE, prob=predict_density(Model))
summary(SampleMod3 %in% Model)
   Mode    TRUE    NA's 
logical    1000       0

**sconfluentus** · Answer 2 · 2017-01-05T23:02:15+00:00

I assume that in practice you are using a real set of non-randomly generated data. In which case the likely values of the different samples have a probability of coming up because random sampling method does not mean no pattern in the data. In the wilderness real things have real frequencies, which will show in your meta-sample.

So you should use the weighted probabilities in selecting your smaller sub-sample from the original.

Example the whole population {1,2,1,3,4,1,3} where probabilities for each number being drawn (remember the sum must equal 1): 1 : .4285 2 :.1429 3: .2857 4: .1429

if you use these weighted probabilities in the prob= my_freqs part of

sample(x, size, replace = FALSE, prob = my_freqs)

You will likely obtain a probability more inline with what you were expecting. But I am not 100% sure if this is what you are after.

In the random data, try set.seed(2) and see if telling R to use the seed used to generate those frequencies in the original set creation gets you closer to your goal.

I know that there is a universal random formula associated with each set. I would have to assume it is a set of frequency probabilities of a method of generating them for various sets of random methods, so it might help you o use that prior to sampling from the random sets.

TechQA.

How to sample data based off the distribution of another dataset in R

There are 2 answers

Related Questions in R

Related Questions in SAMPLING

Related Questions in KERNEL-DENSITY

Popular Questions

Popular Tags

Trending Questions