R comparing unequal vectors with inequality

883 views Asked by At

I have two single vector data frames of unequal length

aa<-data.frame(c(2,12,35))
bb<-data.frame(c(1,2,3,4,5,6,7,15,22,36))

For each observation in aa I want to count the number of instances bb is less than aa

My result:

   bb<aa 
1   1
2   7
3   9

I have been able to do it two ways by creating a function and using apply, but my datasets are large and I let one run all night without end.

What I have:

fun1<-function(a,b){k<-colSums(b<a)
                    k<-k*.000058242}

system.time(replicate(5000,data.frame(apply(aa,1,fun1,b=bb))))
       user  system elapsed 
      3.813   0.011   3.883 

Secondly,

fun2<-function(a,b){k<-length(which(b<a))
                    k<-k*.000058242}

system.time(replicate(5000,data.frame(apply(aa,1,fun2,b=bb))))
   user  system elapsed 
  3.648   0.006   3.664 

The second function is slightly faster in all my tests, but I let the first run all night on a dataset where bb>1.7m and aa>160k

I found this post, and have tried using with() but cannot seem to get it to work, also tried a for loop without success.

Any help or direction is appreciated.

Thank you!

2

There are 2 answers

3
jlhoward On BEST ANSWER
aa<-data.frame(c(2,12,35))
bb<-data.frame(c(1,2,3,4,5,6,7,15,22,36))
sapply(aa[[1]],function(x)sum(bb[[1]]<x))
# [1] 1 7 9

Some more realistic examples:

n  <- 1.6e3
bb <- sample(1:n,1.7e6,replace=T)
aa <- 1:n
system.time(sapply(aa,function(x)sum(bb<x)))
#    user  system elapsed 
#   14.63    2.23   16.87 

n  <- 1.6e4
bb <- sample(1:n,1.7e6,replace=T)
aa <- 1:n
system.time(sapply(aa,function(x)sum(bb<x)))
#    user  system elapsed 
#  148.77   18.11  167.26 

So with length(aa) = 1.6e4 this takes about 2.5 min (on my system), and the process scales as O(length(aa)) - no surprise there. Therefore, with your full dataset, it should run in about 25 min. Still kind of slow. Maybe someone else will come up with a better way.

3
Jeff Tilton On

My original post I had been looking for the number of times bb

So in my example

aa<-data.frame(c(2,12,35))
bb<-data.frame(c(1,2,3,4,5,6,7,15,22,36))

 x<-ecdf(bb[,1])

 x(2)

[1] 0.2

 x(12)

[1] 0.7

 x(35)

[1] 0.9

To get the answers in my original post I would need to multiply by the number of data points within bb, in this instance 10. Although the first one is not the same because in my original post I had stated bb

I am dealing with large datasets of land elevation and water elevation over 1 million data points for each, but in the end I am creating an inundation curve. I want to know how much land will be inundated at a water levels given exceedance probability.

So using the above ecdf() function on all 1 million data points would still be time consuming, but I realized I do not need all the data points just enough to create my curve.

So I applied the ecdf() function to the entire land data set, but then created an elevation sequence of the water large enough to create the curve that I needed, but small enough that it could be computed rapidly.

land_elevation <- data.frame(rnorm(1e6))
water_elevation<- data.frame(rnorm(1e6))

cdf_land<- ecdf(land_elevation[,1])

elevation_seq <- seq(from = min(water_elevation[,1]), to = max(water_elevation[,1]), length.out = 1000)

land                  <- sapply(elevation_seq, cdf_land)

My results are the same, but they are much faster.