intermediate/advanced filtering in dplyr across groups

88 views Asked by At

I offered to help a friend with a problem, and am quickly realizing it is beyond my skills. I am interested in filtering to delete records of a group which fall on or after the first record of another group.

I am grouping by 'species', 'year', and 'sex', and would like to remove any records where 'sex' is "f" which occur after the first 'observation_doy' of the "m". In this example, the records I would like to delete are indicated in bold.

library(tidyverse)
library(janitor)

Original

    species               generations  year  observation_doy sex
1   Linnaea borealis       partially bimodal    2009    165 f
2   Linnaea borealis       partially bimodal    2010    150 f
3   Linnaea borealis       partially bimodal    2010    155 f
4   Linnaea borealis       partially bimodal    2010    160 m
**5 Linnaea borealis       partially bimodal    2010    160 f**
6   helianthus deserticola  partially bimodal   2009    174 f
7   helianthus deserticola  partially bimodal   2009    174 f
8   helianthus deserticola  partially bimodal   2009    180 m
**9 helianthus deserticola  partially bimodal   2009    180 f**
10  helianthus deserticola  partially bimodal   2009    184 m
11  helianthus deserticola  partially bimodal   2010    174 f
12  helianthus deserticola  partially bimodal   2010    174 f
**13    helianthus deserticola  partially bimodal   2010    180 f**
14  helianthus deserticola  partially bimodal   2010    180 m
15  helianthus deserticola  partially bimodal   2010    184 m
16  helianthus deserticola  partially bimodal   2011    174 f
17  helianthus deserticola  partially bimodal   2011    174 f
18  helianthus deserticola  partially bimodal   2011    180 f
19  helianthus deserticola  partially bimodal   2011    180 m
**20    helianthus deserticola  partially bimodal   2011    184 f
21  helianthus deserticola  partially bimodal   2011    184 f**
22  helianthus bolanderi    partially bimodal   2009    174 f
23  helianthus bolanderi    partially bimodal   2009    174 f
24  helianthus bolanderi    partially bimodal   2009    180 m
**25    helianthus bolanderi    partially bimodal   2009    180 f**
26  helianthus bolanderi    partially bimodal   2009    184 m

desired outcome:

    
  species                generations       year observation_doy sex 
1   Linneae borealis        partially bimodal   2009    165 f
2   Linneae borealis        partially bimodal   2010    150 f
3   Linneae borealis        partially bimodal   2010    155 f
4   Linneae borealis       partially bimodal    2010    160 m
5   helianthus deserticola  partially bimodal   2009    174 f
6   helianthus deserticola  partially bimodal   2009    174 f
7   helianthus deserticola  partially bimodal   2009    180 m
8   helianthus deserticola  partially bimodal   2009    180 f
9   helianthus deserticola  partially bimodal   2009    184 m
10  helianthus deserticola  partially bimodal   2010    174 f
11  helianthus deserticola  partially bimodal   2010    174 f
12  helianthus deserticola  partially bimodal   2010    180 m
13  helianthus deserticola  partially bimodal   2010    184 m
14  helianthus deserticola  partially bimodal   2011    174 f
15  helianthus deserticola  partially bimodal   2011    174 f
16  helianthus deserticola  partially bimodal   2011    180 m
17  helianthus bolanderi    partially bimodal   2009    174 f
18  helianthus bolanderi    partially bimodal   2009    174 f
19  helianthus bolanderi    partially bimodal   2009    180 m
20  helianthus bolanderi    partially bimodal   2009    184 m

The original dataset has roughly 10k records of 10 columns, so it is quite manageable. However I do not seem to have a good way to go about this problem. Below are a number of approaches I have tried- and why I suspect they failed; again I suspect these are invalid approaches.

Finding the first male emergence date by group, filter, and slice is quite easy. I suspected that I would be able to create a string of number after this and use a %notin% (negate %in%) to remove female records which follow the date. But I am unaware of how to limit the %notin% to that subset, unless filtering out the dataset. However, this approach seems poor and would create making multiple intermediates.

first_male_emergence <- df %>% dplyr::filter(generations == 'partially bimodal') %>% 
  dplyr::group_by(species, year, sex) %>% 
  dplyr::filter(sex == 'm') %>% 
  dplyr::slice_min(sampling_doy, n=1)

I have also tried to create a 'two sided' filter, which seems out of line with dplyr philosophy. I think the issue with this approach is getting dplyr to recognize the main filtering criteria, in this instance the '<=' sign.

clean_df <- raw_df %>%  
  group_by(species, year, sex) %>%
  dplyr::filter(sex == 'f' & sampling_doy <= print(filter(sex == 'm',(slice_min(sampling_doy, n = 1)))))

Note that I have 'print' on the expression to the right, in an attempt to return the numeric value for the <= operator to evaluate, which does not work. I tried multiple ways to print this value so that the operator was not reacting to the verb filter, i.e. variants of {.} but to no avail (I suspect I was doing this wrong). I played around with nesting each side of the statement in () to nest, but to no avail.

Finally I have tried using case_when, however there is also an issue with allowing the function to determing where the LHS and RHS are due to many operators.

clean_df <- raw_df %>% 
  group_by(species, year, sex) %>% 
  mutate(sampling_doy_1 = case_when(
    (sex == 'f' & sampling_doy <=
       filter(sex == 'm',(slice_max(sampling_doy, n = 1)) ~ sampling_doy, 
    (sex == 'f' & sampling_doy >= 
       filter(sex == 'm',(slice_max(sampling_doy, n = 1)) ~ NA,
  ))))))

I also tried a variant on this:

clean_df <- raw_df %>% 
  dplyr::filter(generations == 'partially bivoltine') %>% 
  group_by(species, year, sex) %>% 
  mutate(sampling_doy_1 = case_when(
  sex == 'f' & sampling_doy < sex == 'm', sampling_doy ~ sampling_doy,
  sex == 'f' & sampling_doy > sex == 'm', sampling_doy ~ NA,
))

I have also considered using arrange within the group by, and trying to slice all female records before the first male record. However, that does not seem like a good approach.

So my first question is: Can anyone solve this problem? It seems to me that using case_when and turning the left hand expression into a function which is called within case_when is the best course of action. However, at the same time I feel like that is fundamentally not how case_when is set up to be used based on other examples I have found. I sometimes finangle some very simple maths into it, but usually quite simple maths which acts the same across the columns of interest.

the second question is: is there a generally suggestable approach for this topic, or do I have to rely on writing functions for something like this?

Apologies for the length of this post, but any help is very appreciated. I have also tagged datatable as it seems it may have a nice simple solution.

1

There are 1 answers

0
bcarlsen On BEST ANSWER

This can be acomplished with a join.

male_first_obs <- my_data %>%
  group_by(species, year) %>%
  filter(sex == "m") %>%
  summarize(male_first_obs_doy = min(observation_doy))
 
my_data %>%
  left_join(male_first_obs, by = c(species, year)) %>%
  group_by(species, year) %>%
  filter(!(sex == "f" & observation_day > male_first_obs_doy)) %>%
  select(-male_first_obs_doy)

If you don't care about dbplyr compatibility, etc you can probably get even more concise, e.g.:

my_data %>%
  group_by(species, year) %>%
  mutate(male_first_obs_day = min(observation_doy[which(sex == "m")])) %>%
  filter(!(sex == "f" & observation_day > male_first_obs_doy)) %>%
  select(-male_first_obs_doy)