I have two large data sets that look like this.
library(tidyverse)
dat1 <- tibble(chrom=c(rep(c("Chr1","Chr2"),each=5)),
start=c(9885,11944, 13271,15104,19059,25793,97514,104718,118862,120950),
end=c(11008,17644,20164,23807,25264,106001,119205, 121576,124981,138514)
)
head(dat1,n=4)
#> # A tibble: 10 × 3
#> chrom start end
#> <chr> <dbl> <dbl>
#> 1 Chr1 9885 11008
#> 2 Chr1 11944 17644
#> 3 Chr1 13271 20164
#> 4 Chr1 15104 23807
dat2 <- tibble(chrom=c(rep(c("Chr1","Chr2"),each=5)),
start=c(9885,11944, 13271,15104,19059,25793,97514,104718,118862,120950),
end=c(10203,12546,13669,15638,19283,26703,97773, 105102,119388,121331)
)
head(dat2, n=4)
#> # A tibble: 10 × 3
#> chrom start end
#> <chr> <dbl> <dbl>
#> 1 Chr1 9885 10203
#> 2 Chr1 11944 12546
#> 3 Chr1 13271 13669
#> 4 Chr1 15104 15638
Created on 2022-12-05 with reprex v2.0.2
I want to group my data based on the chrom and find which ranges [start-end], from the Chr1, from dat2 overlap with ranges [start-end] of Chr1 from dat1.
What I have tried
I have found a lovely package to work it around, but I feel I need to break the data sets into data frames of different chromosomes to compare them.
library(plyranges)
dat1 <- dat1 %>%
as_iranges()
dat2 <- dat2 %>%
as_iranges()
dat1 %>%
mutate(n_olap = count_overlaps(., dat2),
n_olap_within = count_overlaps_within(., dat2))
IRanges object with 10 ranges and 3 metadata columns:
start end width | chrom n_olap n_olap_within
<integer> <integer> <integer> | <character> <integer> <integer>
[1] 9885 11008 1124 | Chr1 1 0
[2] 11944 17644 5701 | Chr1 3 0
[3] 13271 20164 6894 | Chr1 3 0
[4] 15104 23807 8704 | Chr1 2 0
[5] 19059 25264 6206 | Chr1 1 0
[6] 25793 106001 80209 | Chr2 3 0
[7] 97514 119205 21692 | Chr2 3 0
[8] 104718 121576 16859 | Chr2 3 0
[9] 118862 124981 6120 | Chr2 2 0
[10] 120950 138514 17565 | Chr2 1 0
To get what I want from here, I need to filter my data and then compare it. But there should a way or dplyr trick to find a solution
dat1 <- dat1 %>%
as_iranges() %>%
filter(chrom=="Chr1")
dat2 <- dat2 %>%
as_iranges() %>%
filter(chrom=="Chr1")
dat1 %>%
mutate(n_olap = count_overlaps(., dat2),
n_olap_within = count_overlaps_within(., dat2))
Is there a way to compare only chromosomes with each other?
I'll stick with the
data.table
theme set forth by RicVillalba, but I think thefoverlaps
function is meant for things like this (especially with larger datasets).(Note that in addition to requiring keys, the order matters: the last two keys must be the "start" and "end" of the ranges to check for overlapping. I added
chrom
to ensure we do this by-chromosome.)That's the start. I added the
id
column intodat1
so that we could efficiently return back to the original columns. If you inspect the columns closely, notice that thei.*
columns are fromdat1
, so those are the ones we want to keep.Extending that to do the aggregation you're hoping for,
I chose to create
overlaps
and join it back ontodat1
in case there were other columns that needed to be retained. It's certainly feasible to do this more in-place indat1
, especially helpful if your dataset is too big for a temporary copy.