I have a dataset of rows of genes each with their gene length, I am looking to sample from these genes by their gene length distribution using rejection sampling - as I have too many genes in this dataset that are too small to enter further analysis (but I can't just set a cut-off on my own to remove them). I have 1 dataset of genes with gene lengths to sample from, and another proposal distribution of gene lengths that I want to use in order to do the rejection sampling of the first dataset.
An example of my data looks like this:
#df1 data to sample from:
Gene Length
Gene1 5
Gene2 6
Gene3 400000
Gene4 1000
Gene5 25000
Gene6 10
Gene7 50
Gene8 4
Gene9 100
Gene10 2000
My proposal dataset:
#df2
Gene Length
Gene1 5000
Gene2 60000
Gene3 400000
Gene4 1000
Gene5 25000
Gene6 10000
Gene7 50000
Gene8 4000
Gene9 1000
Gene10 2000
I don't have any statistics background, and I am trying to perform rejection sampling (my overall aim is to get a sample of genes with less extremely small genes in length to take onto further analysis).
To do rejection sampling I am trying this from a tutorial I found here:
X = df1$Length
U = df2$Length
accept = c()
count = 1
pi_x <- function(x) {
new_x = (3/2)*(x^3)+(11/8)*(x^2)+(1/6)*(x)+(1/12)
return(new_x)
}
while(count <= 50 & length(accept) < 50){
test_u = U[count]
test_x = pi_x(X[count])/(3.125*dunif(X[count],0,1))
if (test_u <= test_x){
accept = rbind(accept, X[count])
count = count + 1
}
count = count + 1
}
My problem with this is that it only selects 25 genes (my ideal sampling range for further analysis is 50-100 genes being selected) and most of these 25 genes are still very small in size after sampling. Is it that I need to transform X
in some way before running this rejection sampling code? My actual data for df1
is 800 genes with a skewed/beta distribution of gene length (most are very small). Or am I completely missing something else in my understanding? Any guidance at all would be appreciated.
Input data:
df1 <- structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5",
"Gene6", "Gene7", "Gene8", "Gene9", "Gene10"), Length = c(5L,
6L, 400000L, 1000L, 25000L, 10L, 50L, 4L, 100L, 2000L)), row.names = c(NA,
-10L), class = c("data.table", "data.frame"))
df2 <- structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5",
"Gene6", "Gene7", "Gene8", "Gene9", "Gene10"), Length = c(5000L,
60000L, 400000L, 1000L, 25000L, 10000L, 50000L, 40000L, 1000L, 2000L)), row.names = c(NA,
-10L), class = c("data.table", "data.frame"))
Edit:
I have also tried:
sampled <- data.frame(proposal = df2$Length)
sampled$targetDensity <- dbeta(sampled$proposal, 3,6)
maxDens = max(sampled$targetDensity, na.rm = T)
sampled$accepted = ifelse(df1$Length < sampled$targetDensity / maxDens, TRUE, FALSE)
hist(sampled$proposal[sampled$accepted], freq = F, col = "grey", breaks = 100)
curve(dbeta(x, 3,6),0,1, add =T, col = "red")
But I am certain I am not using dbeta()
correctly since sampled$targetDensity
outputs all zeros - is there a way I can fix this? I have tried changing the values in dbeta()
but haven't had any success.
If you know the number of genes you want to sample, the sample function should work fine:
if you want to reduce even more the probability of sampling a gene with a small length, you can square the length for the
prob
argument. The argument prob is associating a sampling probability to each element (here based on the length)If you don't know the number of genes you want to obtain, then you can define your own probability function:
note that you can also use the output of the
rejection.prob
function for thesample
function.Depending on what you expect, you might want to have your rejection function to be more continuous (without these breaks at too.small and it.is.ok). If it's the case, I would use a sigmoid function where you can tune your parameters depending on the desired output.