Given an n*2 data matrix X I'd like to calculate the bivariate empirical cdf for each observation, i.e. for each i in 1:n, return the percentage of observations with 1st element not greater than X[i,1] and 2nd element not greater than X[i,2].
Because of the nested search involved it gets terribly slow for n ~ 100k, even after porting it to Fortran. Does anyone know if there's a better way of handling sample sizes like this?
Edit: I believe this problem is similar (in terms of complexity) to finding Kendall's tau, which is of order O(n^2). In that case Knight (1966) has an algorithm to reduce it to O(n log(n)). Just wondering if there's any O(n*log(n)) algorithm for finding bivariate ecdf already out there.
Edit 2: This is the code I have in Fortran, as requested. This is called in R in the usual way, so the R code is omitted here. The code is meant for arbitrary dimensions, but for the specific thing I'm doing a bivariate one is good enough.
! Calculates multivariate empirical cdf for each point
! n: number of observations
! d: dimension (>=2)
! umat: data matrix
! outvec: vector of ecdf
subroutine mecdf(n,d,umat,outvec)
implicit none
integer :: n, d, i, j, k, tempsum
double precision, dimension(n) :: outvec
double precision, dimension(n,d) :: umat
logical :: flag
do i = 1,n
tempsum = 0
do j = 1,n
flag = .true.
do k = 1,d
if (umat(i,k) < umat(j,k)) then
flag = .false.
exit
end if
end do
if (flag) then
tempsum = tempsum + 1
end if
end do
outvec(i) = real(tempsum)/n
end do
return
end subroutine
I think my first effort was not really an ecdf, although it did map the points to the interval [0,1] The example, a 25 x 2 matrix generated with:
So the task is to do a single summation of a two-part logical test on N items which I suspect is O(N*3). It might be marginally faster if implemented in Rcpp, but these are vectorized operations.