How to avoid multiple loops with multiple variables in R

338 views Asked by At

I have a two datasets stored in tables, one is a set of [a, b] and another is [x, Sx, y, Sy, rho]. I have a probability function f that requires (a, b, x, Sx, y, Sy, rho). In the end I want to find the sum of the probability results over all [x, Sx, y, Sy, rho] for the first [a, b]. Then find the sum for all [x, Sx, y, Sy, rho] over the second [a, b], etc...

I would like to have a few hundred rows in the [x, Sx, y, Sy, rho] file and a few hundred thousand rows in the [a, b] file.

I'm wondering if there is a way to do this without using two loops? I've tried the following, and it doesn't quite work the way I want it to, but I know it will be far too slow.

I don't know if it will help but I've added the function in the code. Sorry that the function itself is a mess and not formatted properly.

# data  file with (a, b)
data            <- matrix( c(1, 0, 1, 1, 0.5, 0), nrow=3, ncol=2) 
colnames(data)  <- c("a", "b") 
Ndat            <- dim(data)
Ndata           <- Ndat[1]

# data2 file with (x, Sx, y, Sy, rho)
data2           <- matrix( c(1, 0.1, 1, 0.1, 0.002, 2, 0.1, 2, 0.1, 0.000001, 
                             2, 0.1, 1, 0.1, 0.002), nrow=3, ncol=5) 
colnames(data2) <- c("x", "Sx", "y", "Sy", "rho") 
Ndat2           <- dim(data)
Ndata2          <- Ndat[1]

# function requires variables (a, b, s, Sx, y, Sy, rho) 
Prob  <- function(a, b, Xi, sX, Yi, sY, rho) {sqrt(1 + a ^ 2) * (
  exp(-((b + a * Xi - Yi) ^ 2 / (
    2 * ((a ^ 2 * sX ^ 2) - 
         (2 * a * rho * sX * sY) + sY ^ 2)))) * sqrt((
           1 - rho ^ 2) / (
             a ^ 2 * sX ^ 2 - 2 * a * rho *sX *sY + sY ^ 2))/(
               sqrt(2 * pi) * sqrt(1 - rho ^ 2)))
    }

# Here is my weak attempt
Table <- NULL
Table <- for (j in 1:Ndata) { 
   sum (for (i in 1:Ndata2) {
   Datatable[i] = Prob(data[j, a], data[j, b], data2[i, x], 
                 data2[i, Sx], data2[i, y], data2[i, Sy], 
                 data2[i, rho])
   })
}

I am having a very hard time wrapping my head around the apply functions and when they can/should be used. I know that I've probably not added enough information, so any suggestions that can help me out would be great. I'm pretty new to programming as well as R, so please forgive any inappropriate vocabulary or formatting.

There is probably a better way to define the number or rows in data to get Ndata as a global, but these are the first I stumbled across.

The function should not be recursive, but I see now that it is as I've written it. I have spent many hours on intro tutorials to R and still am having a very hard time understanding how the apply suite of functions are best implemented.

I would like one iteration to apply this function to each row in data2 using a, b from the first row of data. Then sum the probability for all of those. Then the next iteration should sum all of the probabilities for row 2 of data using a, b applied to every row of data2

1

There are 1 answers

1
jlhoward On

I have a feeling there's an easier way to do this, but something like this will probably work.

f <- function(a,b,x,y,z) a+b+x+y+z
f.new <- function(p1,p2) {
  p1=as.list(p1); p2=as.list(p2)
  f(p1$a,p1$b,p2$x,p2$y,p2$z)
}

data1 <- data.frame(a=1:10,b=11:20)
data2 <- data.frame(x=1:5,y=21:25,z=31:35)
indx  <- expand.grid(indx2=seq(nrow(data2)),indx1=seq(nrow(data1)))
result <- with(indx,f.new(data1[indx1,],data2[indx2,]))
sums   <- aggregate(result,by=list(rep(seq(nrow(data1)),each=nrow(data2))),sum)

You seem to want to evaluate a function for every combination of two sets of variables, the set of (a,b) and the set of (x, Sx, y, Sy, rho), then sum over the second set, for every instance of the first set.

So first this redefines the function f(...) to take two arguments, representing the two sets. This is f.new(...). You should probably define your original function that way - it will run faster.

Then we create a data frame, indx that has two columns, representing every combination of the row numbers in data1 and data2, then we call f.new(...) using data1 and data2 indexed using indx. This produced result which has the function evaluated at every combination of (a,b) and (x,y,z). Then we aggregate that to get the sums you specified.

This approach is memory intensive; result will have ~ 10MM elements, but will run faster than loops.