I'm trying to get some input on how I might speed up a for loop that I've written. Essentially, I have a dataframe (DF1
) where each row provides the latitude and longitude/point at a given time. The variables therefore are the lat and long for the point and the timestamp (date and time object). In addition, each row is a unique timestamp, for a unique point (in other word no repeats).
I'm trying to match it up to weather data which is contained in an netcdf file. The hourly weather data I have is a 3 dimensional file that includes the lats and longs for grids, time stamps, and the value for that weather variable. I'll call the weather variable u
for now.
What I want to end up with: I have a column for the u
values in DF1
. It starts out with only missing values. In the end I want to replace those missing values in DF1
with the appropriate u
value from the netcdf file.
How I've done this so far: I have constructed a for loop that can extract the appropriate u
values for the nearest grid to each point in DF1
. For example, for lat x and long y in DF1
I find the nearest grid in the netcdf file and extract the full timeseries. Then, within the for loop, I amend the DF1$u
values with the appropriate data. DF1
is a rather large dataframe and the netcdf is even bigger (DF1
is just a small subset of the full netcdf).
Some psuedo data:
ID <-c("1","2","3","4","5","6")
datetime <- c("1/1/2021 01:00:00"
, "1/5/2021 04:00:00"
, '2/1/2021 06:00:00'
, "1/7/2021 01:00:00"
, "2/2/2021 01:00:00"
, "2/5/2021 02:00:00")
lat <- c(34,36,41,50,20,40)
long <- c(55,50,‑89,-175,-155,25)
DF1 <- data.frame(ID, datetime, lat, long)
DF1$u <- NA
ID datetime lat long u
1 1 1/1/2021 01:00:00 34 55 NA
2 2 1/5/2021 04:00:00 36 50 NA
3 3 2/1/2021 06:00:00 41 -89 NA
4 4 1/7/2021 01:00:00 50 -175 NA
5 5 2/2/2021 01:00:00 20 -155 NA
6 6 2/5/2021 02:00:00 40 25 NA
Here is an example of the type of for loop I've constructed, I've left out some of the more specific details that aren't relevant:
for(i in 1:nrows(DF1) {
### a number of steps here that identify the closest grid to each point. ####
mat_index <- as.vector(arrayInd(which.min(dist_mat), dim(dist_mat)))
# Extract the time series for the lat and long that are closest. u is the weather variable, time is the datetime variable, and geometry is the corresponding lat long list item.
df_u <- data.frame(u=data_u[mat_index[2],mat_index[1],],time=timestamp,geometry=rep(psj[i,1],dim(data_u)[3]))
# To make things easier I seperate geometry into a lat and a long col
df_u <- tidyr::separate(df_u, geometry, into = c("long", "lat"), sep = ",")
df_u$long <- gsub("c\\(", "", df_u$long)
df_u$lat <- gsub("\\)", "", df_u$lat)
# I find that datatables are a bit easier for these types of tasks, so I set the full timeseries data and DF1 to data table (in reality I set df1 as a data table outside of the for loop)
df_u <- setDT(df_u)
# Then I use merge to join the two datatables, replace the missing value with the appropriate value from df_u, and then drop all the unnecessary columns in the final step.
df_u_new <- merge(windu_df, df_u, by =c("lat","long","time"), all.x = T)
df_u_new[, u := ifelse(is.na(u.x), u.y, u.x)]
windu_df <- df_u_new[, .(time, lat, long, u)]
}
This works, but given the sheer size of the dataframe/datatables that I'm working with, I wonder if there is a faster way to do that last step in particular. I know merge is probably the slowest way to do this, but I kept running into issues using match()
and inner_join
.
Unfortunately I'm not able to really give fully reproduceable data here given that I'm working with a netcdf file, but df_u looks something like this for the first iteration:
ID <-c("1","2","3","4","5","6")
datetime <- c("1/1/2021 01:00:00"
, "1/1/2021 02:00:00"
, "1/1/2021 03:00:00"
, "1/1/2021 04:00:00"
, "1/1/2021 05:00:00"
, "1/1/2021 06:00:00")
lat <- c(34,34,34,34,34,34)
long <- c(55,55,55,55,55,55)
u <- c(2.8,3.6,2.1,5.6,4.4,2,5)
df_u <- data.frame(ID, datetime, lat, long,u)
ID datetime lat long u
1 1 1/1/2021 01:00:00 34 55 2.8
2 2 1/1/2021 02:00:00 34 55 3.6
3 3 1/1/2021 03:00:00 34 55 2.1
4 4 1/1/2021 04:00:00 34 55 5.6
5 5 1/1/2021 05:00:00 34 55 4.4
6 6 1/1/2021 06:00:00 34 55 2.5
Once u is amended in DF1 it should read:
ID datetime lat long u
1 1 1/1/2021 01:00:00 34 55 2.8
2 2 1/5/2021 04:00:00 36 50 NA
3 3 2/1/2021 06:00:00 41 -89 NA
4 4 1/7/2021 01:00:00 50 -175 NA
5 5 2/2/2021 01:00:00 20 -155 NA
6 6 2/5/2021 02:00:00 40 25 NA
In the next iteration, the for loop will extract the relevant weather data for the lat and long in row 2 and then retain 2.8 and replace the NA in row 2 with a value.
EDIT: The NETCDF data covers an entire year (so every hour for a year) for a decently large spatial area. It's ERA5 data. In addition, DF1 had thousands of unique lat/long and timestamp observations.