I have a list of events with (x,y) coordinates, where x ranges from 1 to 100 and y from 1 to 86. There are (often many) duplicates of each coordinate. I want to populate a matrix (effectively a number grid) with the counts of each coordinate. How do I do this?

Right now, my best attempt is:

s=matrix(data=NA,nrow=n,ncol=k)
for(i in 1:n){
  for(j in 1:k){
    s[i,j]=nrow(subset(data,x_column==i & y_column==j))
  }
}

This works for small (~10,000 rows) data frames, but I'd like to run it for a data frame with nearly 3 million rows, and my method is far too slow.

Edit (data):

n=86;k=100;
x_column y_column
54          30
51          32
65          34
19          46
51          27
45          60
62          31
64          45
16          69
31          33

Thanks guys!

Edit: well, it turns out the program was fast enough for my needs -- my workspace was just bogged down with tons of data, and it was slowing everything I tried to do down. So my method works, but it's good to know alternate ways of populating a matrix. I uploaded the first 10 rows; could someone do a speed test?

1

There are 1 answers

1
Josh O'Brien On BEST ANSWER

Here's one approach, using the data.table and Matrix packages:

library(data.table)
library(Matrix)

f <- function(df, nx, ny)  {
    ## Tally up the frequencies
    dt <- data.table(df, key=c("x", "y"))
    xyN <- dt[, .N, by=key(dt)]
    ## Place counts in matrix in their respective i/j x/y row/column
    as.matrix(with(xyN, sparseMatrix(i=x,j=y,x=N,dims=c(nx,ny))))
}

## Check that it works:
df <- data.frame(x=c(2,2,2,3,3,3), y=c(1,1,1,1,2,2))
f(df, nx=4, ny=4)
#      [,1] [,2] [,3] [,4]
# [1,]    0    0    0    0
# [2,]    3    0    0    0
# [3,]    1    2    0    0
# [4,]    0    0    0    0

## Speed test with 3 million coordinates
df <- data.frame(x=sample(1:100, 3e6,replace=T), y=sample(1:86, 3e6, replace=T))
system.time(res <- f(df, nx=100, ny=86))
#    user  system elapsed 
#    0.16    0.03    0.19 
sum(res)
# [1] 3e+06

If you can guarantee that you'll have at least some coordinates in each possible row and column, you can just use base R's table() (though it isn't nearly as fast):

df <- data.frame(x=sample(1:100, 3e6,replace=T), y=sample(1:86, 3e6, replace=T))
system.time(res2 <- as.matrix(table(df)))
#    user  system elapsed 
#    2.67    0.07    2.74 
sum(res2)
# [1] 3000000