I have a massive data frame in R that looks something like this:
Dataframe 1
| ID | start | stop |
|---|---|---|
| + | 3 | 5 |
| - | 9 | 13 |
I also have another dataframe that looks something like this:
Dataframe 2
| ID | name | position |
|---|---|---|
| + | hello | 4 |
| + | there | 7 |
| - | my | 11 |
| + | name | 12 |
What I want to do is for every row in dataframe 1, I want to determine it's "distance" within a given range or print NA if it doesn't fall into a range or the "ID" (+/-) doesn't match. So the output would be something like:
| ID | name | position | relative_position |
|---|---|---|---|
| - | hello | 4 | 2 |
| + | there | 7 | NA |
| - | my | 12 | 3 |
| + | name | 12 | NA |
I currently have a relatively long loop that uses apply() and a few layers of if functions to run through this process. The problem I'm having is that I need to do this for a multiple REALLY big dataframes (gigabytes in size), so efficiency is key. What's the most efficient way of going through this process?
I've also played around with the cut() function but I can't figure out how to print the results I want or how to make intervals discontinuous. Even if it works I don't know if it would be more efficient.
Update
I have now the full code. It works even if there is overlap of intervals. In this case, the algorithm finds the closest interval by default. If the option
closestOnly = FALSE, then it looks farther away ifiddoes not match.Wimpel is really fast but it does not necessary find the closest distance. This is illustrated by this example:
If overlapping intervals is fine as for Wimpel method, go for it as performance is a lot better. But if you need better interval overlapping solution, this algorithm is for you.
For speed comparison,
A reimplementation of my algorithm in C (what data.table does under the hood), should reach similar performance (or better).
Old post
It is mainly a problem of finding if many positions are in intervals. This is an algorithmic issue.
Here is an efficient algorithm to find for all position the containing intervals. I leave in exercise the computation of distance and the proper formatting of the result ;-) (it should be easy from this canvas).
EDIT: Careful, the algorithm only works properly if start and end are well imbricated. (1, 5) and (3, 7) are not properly imbricated (3 is in (1, 5) and 7 outside). The pop on 5 from the stack will remove the interval (3, 7). If the interval are not properly imbricated, the stack should become a set and the pop on an interval end should remove the matching start.