I would really appreciate some assistance with sorting some data to reduce temporal auto-correlation for a massive dataset that I have (almost 2600 rows). I'm sure this is probably quite an easy thing to do with something like tidyverse but I'm still very new to data cleaning using R and need to learn.
It's a very large table with entomological data... insect trap count, longitude, latitude, date, and a load of environmental variables (some numeric, some categorical). A hypothetical (much smaller) example would be:
df <- data.frame(
insects = round(runif(20)*100),
long = c(rep(22.4,5), rep(18.4,5), rep(21.8,5), rep(20.1,5)),
lat = c(rep(-34.1,5), rep(-34.4,5), rep(-33.8,5), rep(-33.7,5)),
land = c(rep("AG8",5), rep("GA6",5), rep("HE1",5), rep("JA2",5)),
temp = round(runif(20)*15),
NDVI = c(rep(c(48032730,48493073,48449380,48423773,48420000),4)),
precip = round(runif(20)*15),
date = lubridate::as_date(rep(c("2006-01-15", "2006-01-25", "2006-01-04", "2007-02-15", "2007-02-18", "2006-01-15", "2008-02-20", "2008-02-01", "2009-04-08", "2009-04-19"),2)))
I'd like to create a new dataframe, whereby every location (long/lat) which is the same, that has multiple counts in a calendar month, is all merged into one row, whereby the insect count and the environmental variables are all averaged for that month (the categorical values should be the same for every location.
So for this dataframe example above, instead of 20 rows I would end up with: 2 rows for first location (jan 2006 and feb 2007) 3 rows for second location (jan 2006, feb 2008, apr 2009) 2 rows for third location (jan 2006 and feb 2007) 3 rows for fourth location (jan 2006, feb 2008, apr 2009)
Thanks for your help
First, create a new column for month and year.
Next, group by location (long, lat) and month_year, and calculate the averages: