I have a fairly large dataset of georeferenced data. It's about 49 million records. Using data.table package I've been able to subset it (originally was 100 millions records) and make some simple calculations, like the mean center for the geographic coordinates in degrees for each user.
The unique users are 214,600 and I need to calculate the distance (using great circle formula, SEE my sample code) for each unique user from every geographic coordinate for every record to the mean center of those geographic coordinates. That means I need to use columns V6 and V7 (longitude and latitude respectively) to do the great circle distance calculation. V4 is the userID and V3 is the userImageID, V8 (the column that has 16) is the accuracy for the coordinates. V5 is the time field, which I already sorted (order) in ascending order.
My problem is I cannot make the code iterate only for the coordinates that exclusively correspond to every user and I end up running out of memory because the first record run through the other 48 million records and so on.
I have tried to use my own formula for the great circle distance calculation and also formlas from the packages fossil and geosphere, with no results.
This is more or less how my table looks like (and I'm calling only the columns I'm using). We see here the data for the first user, who has 4 geotagged images.
> subtest
V1      V2         V3            V4                    V5        V6       V7    V8
1:  1  155229 9468411072 100004812@N06 2006-03-19 13:11:37.0 -2.224868 52.20397 16
2:  2  862398 9468409452 100004812@N06 2006-03-19 13:11:49.0 -2.224825 52.20399 16
3:  3 7931625 9465604241 100004812@N06 2006-03-19 15:12:23.0 -2.224890 52.20391 16
4:  4 7924096 9465627119 100004812@N06 2006-03-19 15:12:49.0 -2.224868 52.20397 16
And my code:
library(data.table)
library(fossil)
library(geosphere)
setwd("E:/MassiveDatasets/LargeDataset")
yahoo2 <- fread("LD.csv", sep = ",", header = FALSE, colClasses="numeric")
a <-yahoo2
mlong <- a[, lapply(.SD, mean), by=V4, .SDcols = 6]
mlat <- a[, lapply(.SD, mean), by=V4, .SDcols = 7]
rad <- pi/180
b1 <- (mlat[,V7] * rad)
b2 <- (mlong[,V6] * rad)
Dist <- function(v) { 
  for (i in unique(a[, V3])) { 
  a1 <- a[, V7] * rad
  a2 <- a[, V6] * rad
  dlon <- b2 - a2
  dlat <- b1 - a1
  GC <- (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
  c <- 2 * atan2(sqrt(GC), sqrt(1 - GC))
  R <- 6371.0087714  # WGS84 mean radius   
  d <- R * c
  return(d)
  }
}
rgyr <- a[, lapply(.SD, Dist), by=V4]
Thank you very much in advance for your answers!