I have a two datasets stored in tables, one is a set of [a, b]
and another is [x, Sx, y, Sy, rho]
. I have a probability function f
that requires (a, b, x, Sx, y, Sy, rho)
. In the end I want to find the sum of the probability results over all [x, Sx, y, Sy, rho]
for the first [a, b]
. Then find the sum for all [x, Sx, y, Sy, rho]
over the second [a, b]
, etc...
I would like to have a few hundred rows in the [x, Sx, y, Sy, rho]
file and a few hundred thousand rows in the [a, b]
file.
I'm wondering if there is a way to do this without using two loops? I've tried the following, and it doesn't quite work the way I want it to, but I know it will be far too slow.
I don't know if it will help but I've added the function in the code. Sorry that the function itself is a mess and not formatted properly.
# data file with (a, b)
data <- matrix( c(1, 0, 1, 1, 0.5, 0), nrow=3, ncol=2)
colnames(data) <- c("a", "b")
Ndat <- dim(data)
Ndata <- Ndat[1]
# data2 file with (x, Sx, y, Sy, rho)
data2 <- matrix( c(1, 0.1, 1, 0.1, 0.002, 2, 0.1, 2, 0.1, 0.000001,
2, 0.1, 1, 0.1, 0.002), nrow=3, ncol=5)
colnames(data2) <- c("x", "Sx", "y", "Sy", "rho")
Ndat2 <- dim(data)
Ndata2 <- Ndat[1]
# function requires variables (a, b, s, Sx, y, Sy, rho)
Prob <- function(a, b, Xi, sX, Yi, sY, rho) {sqrt(1 + a ^ 2) * (
exp(-((b + a * Xi - Yi) ^ 2 / (
2 * ((a ^ 2 * sX ^ 2) -
(2 * a * rho * sX * sY) + sY ^ 2)))) * sqrt((
1 - rho ^ 2) / (
a ^ 2 * sX ^ 2 - 2 * a * rho *sX *sY + sY ^ 2))/(
sqrt(2 * pi) * sqrt(1 - rho ^ 2)))
}
# Here is my weak attempt
Table <- NULL
Table <- for (j in 1:Ndata) {
sum (for (i in 1:Ndata2) {
Datatable[i] = Prob(data[j, a], data[j, b], data2[i, x],
data2[i, Sx], data2[i, y], data2[i, Sy],
data2[i, rho])
})
}
I am having a very hard time wrapping my head around the apply
functions and when they can/should be used. I know that I've probably not added enough information, so any suggestions that can help me out would be great. I'm pretty new to programming as well as R, so please forgive any inappropriate vocabulary or formatting.
There is probably a better way to define the number or rows in data
to get Ndata
as a global, but these are the first I stumbled across.
The function should not be recursive, but I see now that it is as I've written it. I have spent many hours on intro tutorials to R and still am having a very hard time understanding how the apply
suite of functions are best implemented.
I would like one iteration to apply this function to each row in data2
using a, b
from the first row of data
. Then sum
the probability for all of those. Then the next iteration should sum all of the probabilities for row 2 of data
using a, b
applied to every row of data2
I have a feeling there's an easier way to do this, but something like this will probably work.
You seem to want to evaluate a function for every combination of two sets of variables, the set of
(a,b)
and the set of(x, Sx, y, Sy, rho)
, then sum over the second set, for every instance of the first set.So first this redefines the function
f(...)
to take two arguments, representing the two sets. This isf.new(...)
. You should probably define your original function that way - it will run faster.Then we create a data frame,
indx
that has two columns, representing every combination of the row numbers indata1
anddata2
, then we callf.new(...)
usingdata1
anddata2
indexed usingindx
. This producedresult
which has the function evaluated at every combination of(a,b)
and(x,y,z)
. Then we aggregate that to get the sums you specified.This approach is memory intensive;
result
will have ~ 10MM elements, but will run faster than loops.