How to implement rejection sampling in R?

431 views Asked by At

I have a dataset of rows of genes each with their gene length, I am looking to sample from these genes by their gene length distribution using rejection sampling - as I have too many genes in this dataset that are too small to enter further analysis (but I can't just set a cut-off on my own to remove them). I have 1 dataset of genes with gene lengths to sample from, and another proposal distribution of gene lengths that I want to use in order to do the rejection sampling of the first dataset.

An example of my data looks like this:

#df1 data to sample from:
Gene  Length
Gene1  5
Gene2  6
Gene3  400000
Gene4  1000
Gene5  25000
Gene6  10
Gene7  50
Gene8  4
Gene9  100
Gene10 2000

My proposal dataset:

#df2
Gene  Length
Gene1  5000
Gene2  60000
Gene3  400000
Gene4  1000
Gene5  25000
Gene6  10000
Gene7  50000
Gene8  4000
Gene9  1000
Gene10 2000

I don't have any statistics background, and I am trying to perform rejection sampling (my overall aim is to get a sample of genes with less extremely small genes in length to take onto further analysis).

To do rejection sampling I am trying this from a tutorial I found here:

X = df1$Length
U = df2$Length

accept = c()
count = 1

pi_x <- function(x) {
  new_x = (3/2)*(x^3)+(11/8)*(x^2)+(1/6)*(x)+(1/12)
  return(new_x)
}


while(count <= 50 & length(accept) < 50){
  test_u = U[count]
  test_x = pi_x(X[count])/(3.125*dunif(X[count],0,1))
  if (test_u <= test_x){
    accept = rbind(accept, X[count])
    count = count + 1
  }
  count = count + 1
}

My problem with this is that it only selects 25 genes (my ideal sampling range for further analysis is 50-100 genes being selected) and most of these 25 genes are still very small in size after sampling. Is it that I need to transform X in some way before running this rejection sampling code? My actual data for df1 is 800 genes with a skewed/beta distribution of gene length (most are very small). Or am I completely missing something else in my understanding? Any guidance at all would be appreciated.

Input data:

df1 <- structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5", 
"Gene6", "Gene7", "Gene8", "Gene9", "Gene10"), Length = c(5L, 
6L, 400000L, 1000L, 25000L, 10L, 50L, 4L, 100L, 2000L)), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))

df2 <- structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5", 
"Gene6", "Gene7", "Gene8", "Gene9", "Gene10"), Length = c(5000L, 
60000L, 400000L, 1000L, 25000L, 10000L, 50000L, 40000L, 1000L, 2000L)), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))

Edit:

I have also tried:

sampled <- data.frame(proposal = df2$Length)
sampled$targetDensity <- dbeta(sampled$proposal, 3,6)

maxDens = max(sampled$targetDensity, na.rm = T)
sampled$accepted = ifelse(df1$Length < sampled$targetDensity / maxDens, TRUE, FALSE)

hist(sampled$proposal[sampled$accepted], freq = F, col = "grey", breaks = 100)
curve(dbeta(x, 3,6),0,1, add =T, col = "red")

But I am certain I am not using dbeta() correctly since sampled$targetDensity outputs all zeros - is there a way I can fix this? I have tried changing the values in dbeta() but haven't had any success.

1

There are 1 answers

4
glagla On

If you know the number of genes you want to sample, the sample function should work fine:

sampled = sample(df$genes, size = n, prob = df$length) 

if you want to reduce even more the probability of sampling a gene with a small length, you can square the length for the prob argument. The argument prob is associating a sampling probability to each element (here based on the length)

If you don't know the number of genes you want to obtain, then you can define your own probability function:

rejection.prob = function(x){
  if (x<too.small) {return(0)} # all genes smaller than too.small won't be sampled
  if (x > it.is.ok) {return(1)} # all these ones will be sampled
  if (x>too.small & (x<it.is.ok){
    # here make any function that is equal to 0 when x == too.small
    # and 1 when x == it.is.ok
    # it can be a simple linear function
}

note that you can also use the output of the rejection.prob function for the sample function.

Depending on what you expect, you might want to have your rejection function to be more continuous (without these breaks at too.small and it.is.ok). If it's the case, I would use a sigmoid function where you can tune your parameters depending on the desired output.