Downsampling dataset

2.8k views Asked by At

I have a dataset, which is a large character vector (1,024,459 elements), consisting of gene IDs. It looks like:

> length(allres)
[1] 1024459
>allres[1:10]  
[1] "1"   "1"   "1"   "1"   "1"   "1"   "1"   "10"  "10"  "100"  

where each gene ID is repeated the number of times it was seen in an RNA seq run (so here, there were 7 reads for gene "1", 2 for gene "10"). I want to plot the number of genes identified per number of reads, at 10,000 read intervals, so that I can see how many genes are identified if I randomly sample 10,000 reads, 20,000, 30,0000, etc. I made a spacing vector with the seq() function like so:

> gaps <- seq(10000, length(allres), by=10000)  

but I'm unsure how to apply that to my allres vector and plot it. Any help is quite appreciated.

1

There are 1 answers

5
Oliver Keyes On BEST ANSWER

So, what you probably want is something like this:

gaps <- seq(10000, length(allres), by = 10000)

lapply(gaps, function(x){

    #This will give you the number of appearances of each value, within
    #an gaps[x]-sized sample of allres
    aggregated_sample <- table(sample(allres, size = x))

    #plotting code for sample goes here. And "x" is the number of reads so
    #you can even use it in the title!
    #Just remember to include code to save it to disc, if you want to save it to disc.
    return(TRUE)

})

If you're using ggplot2 for plotting, of course, you can even save the plot as an object and then return(plot) instead of return(TRUE) and do further tweakery/investigation afterwards.