Thoughts on how to speed up replacing a column value when conditions between two objects (dataframe or datatable) are met in a loop?

Question

Thoughts on how to speed up replacing a column value when conditions between two objects (dataframe or datatable) are met in a loop?

80 views Asked by Tham v At 05 January 2025 at 12:37

I'm trying to get some input on how I might speed up a for loop that I've written. Essentially, I have a dataframe (DF1) where each row provides the latitude and longitude/point at a given time. The variables therefore are the lat and long for the point and the timestamp (date and time object). In addition, each row is a unique timestamp, for a unique point (in other word no repeats).

I'm trying to match it up to weather data which is contained in an netcdf file. The hourly weather data I have is a 3 dimensional file that includes the lats and longs for grids, time stamps, and the value for that weather variable. I'll call the weather variable u for now.

What I want to end up with: I have a column for the u values in DF1. It starts out with only missing values. In the end I want to replace those missing values in DF1 with the appropriate u value from the netcdf file.

How I've done this so far: I have constructed a for loop that can extract the appropriate u values for the nearest grid to each point in DF1. For example, for lat x and long y in DF1 I find the nearest grid in the netcdf file and extract the full timeseries. Then, within the for loop, I amend the DF1$u values with the appropriate data. DF1 is a rather large dataframe and the netcdf is even bigger (DF1 is just a small subset of the full netcdf).

Some psuedo data:


ID <-c("1","2","3","4","5","6")
datetime <- c("1/1/2021  01:00:00"
              , "1/5/2021 04:00:00"
              , '2/1/2021 06:00:00'
              , "1/7/2021 01:00:00"
              , "2/2/2021 01:00:00"
              , "2/5/2021 02:00:00")
lat <- c(34,36,41,50,20,40)
long <- c(55,50,‑89,-175,-155,25)
DF1 <- data.frame(ID, datetime, lat, long)
DF1$u <- NA
  ID           datetime lat long   u
1  1 1/1/2021  01:00:00  34   55  NA
2  2  1/5/2021 04:00:00  36   50  NA
3  3  2/1/2021 06:00:00  41  -89  NA
4  4  1/7/2021 01:00:00  50 -175  NA
5  5  2/2/2021 01:00:00  20 -155  NA
6  6  2/5/2021 02:00:00  40   25  NA

Here is an example of the type of for loop I've constructed, I've left out some of the more specific details that aren't relevant:

for(i in 1:nrows(DF1) {
  
  ### a number of steps here that identify the closest grid to each point. ####

  mat_index <- as.vector(arrayInd(which.min(dist_mat), dim(dist_mat)))

  # Extract the time series for the lat and long that are closest. u is the weather variable, time is the datetime variable, and geometry is the corresponding lat long list item.
  df_u <- data.frame(u=data_u[mat_index[2],mat_index[1],],time=timestamp,geometry=rep(psj[i,1],dim(data_u)[3]))
  
  # To make things easier I seperate geometry into a lat and a long col
  df_u <- tidyr::separate(df_u, geometry, into = c("long", "lat"), sep = ",")
  df_u$long <- gsub("c\\(", "", df_u$long)
  df_u$lat <- gsub("\\)", "", df_u$lat)
 
  # I find that datatables are a bit easier for these types of tasks, so I set the full timeseries data and DF1 to data table (in reality I set df1 as a data table outside of the for loop)
  df_u <- setDT(df_u)
  
  # Then I use merge to join the two datatables, replace the missing value with the appropriate value from df_u, and then drop all the unnecessary columns in the final step. 
  df_u_new <- merge(windu_df, df_u, by =c("lat","long","time"), all.x = T)
  df_u_new[, u := ifelse(is.na(u.x), u.y, u.x)]
  windu_df <- df_u_new[, .(time, lat, long, u)]  
  }

This works, but given the sheer size of the dataframe/datatables that I'm working with, I wonder if there is a faster way to do that last step in particular. I know merge is probably the slowest way to do this, but I kept running into issues using match() and inner_join.

Unfortunately I'm not able to really give fully reproduceable data here given that I'm working with a netcdf file, but df_u looks something like this for the first iteration:

ID <-c("1","2","3","4","5","6")
datetime <- c("1/1/2021  01:00:00"
              , "1/1/2021  02:00:00"
              , "1/1/2021  03:00:00"
              , "1/1/2021  04:00:00"
              , "1/1/2021  05:00:00"
              , "1/1/2021  06:00:00")
lat <- c(34,34,34,34,34,34)
long <- c(55,55,55,55,55,55)
u <- c(2.8,3.6,2.1,5.6,4.4,2,5)
df_u <- data.frame(ID, datetime, lat, long,u)

  ID           datetime lat long   u
1  1 1/1/2021  01:00:00  34   55 2.8
2  2 1/1/2021  02:00:00  34   55 3.6
3  3 1/1/2021  03:00:00  34   55 2.1
4  4 1/1/2021  04:00:00  34   55 5.6
5  5 1/1/2021  05:00:00  34   55 4.4
6  6 1/1/2021  06:00:00  34   55 2.5

Once u is amended in DF1 it should read:

  ID           datetime lat long   u
1  1 1/1/2021  01:00:00  34   55  2.8
2  2  1/5/2021 04:00:00  36   50  NA
3  3  2/1/2021 06:00:00  41  -89  NA
4  4  1/7/2021 01:00:00  50 -175  NA
5  5  2/2/2021 01:00:00  20 -155  NA
6  6  2/5/2021 02:00:00  40   25  NA

In the next iteration, the for loop will extract the relevant weather data for the lat and long in row 2 and then retain 2.8 and replace the NA in row 2 with a value.

EDIT: The NETCDF data covers an entire year (so every hour for a year) for a decently large spatial area. It's ERA5 data. In addition, DF1 had thousands of unique lat/long and timestamp observations.

Original Q&A

TechQA.

Thoughts on how to speed up replacing a column value when conditions between two objects (dataframe or datatable) are met in a loop?

There are 0 answers

Related Questions in R

Related Questions in DATATABLES

Related Questions in ERA5

Popular Questions

Popular Tags

Trending Questions