I offered to help a friend with a problem, and am quickly realizing it is beyond my skills. I am interested in filtering to delete records of a group which fall on or after the first record of another group.
I am grouping by 'species', 'year', and 'sex', and would like to remove any records where 'sex' is "f" which occur after the first 'observation_doy' of the "m". In this example, the records I would like to delete are indicated in bold.
library(tidyverse)
library(janitor)
Original
species generations year observation_doy sex
1 Linnaea borealis partially bimodal 2009 165 f
2 Linnaea borealis partially bimodal 2010 150 f
3 Linnaea borealis partially bimodal 2010 155 f
4 Linnaea borealis partially bimodal 2010 160 m
**5 Linnaea borealis partially bimodal 2010 160 f**
6 helianthus deserticola partially bimodal 2009 174 f
7 helianthus deserticola partially bimodal 2009 174 f
8 helianthus deserticola partially bimodal 2009 180 m
**9 helianthus deserticola partially bimodal 2009 180 f**
10 helianthus deserticola partially bimodal 2009 184 m
11 helianthus deserticola partially bimodal 2010 174 f
12 helianthus deserticola partially bimodal 2010 174 f
**13 helianthus deserticola partially bimodal 2010 180 f**
14 helianthus deserticola partially bimodal 2010 180 m
15 helianthus deserticola partially bimodal 2010 184 m
16 helianthus deserticola partially bimodal 2011 174 f
17 helianthus deserticola partially bimodal 2011 174 f
18 helianthus deserticola partially bimodal 2011 180 f
19 helianthus deserticola partially bimodal 2011 180 m
**20 helianthus deserticola partially bimodal 2011 184 f
21 helianthus deserticola partially bimodal 2011 184 f**
22 helianthus bolanderi partially bimodal 2009 174 f
23 helianthus bolanderi partially bimodal 2009 174 f
24 helianthus bolanderi partially bimodal 2009 180 m
**25 helianthus bolanderi partially bimodal 2009 180 f**
26 helianthus bolanderi partially bimodal 2009 184 m
desired outcome:
species generations year observation_doy sex
1 Linneae borealis partially bimodal 2009 165 f
2 Linneae borealis partially bimodal 2010 150 f
3 Linneae borealis partially bimodal 2010 155 f
4 Linneae borealis partially bimodal 2010 160 m
5 helianthus deserticola partially bimodal 2009 174 f
6 helianthus deserticola partially bimodal 2009 174 f
7 helianthus deserticola partially bimodal 2009 180 m
8 helianthus deserticola partially bimodal 2009 180 f
9 helianthus deserticola partially bimodal 2009 184 m
10 helianthus deserticola partially bimodal 2010 174 f
11 helianthus deserticola partially bimodal 2010 174 f
12 helianthus deserticola partially bimodal 2010 180 m
13 helianthus deserticola partially bimodal 2010 184 m
14 helianthus deserticola partially bimodal 2011 174 f
15 helianthus deserticola partially bimodal 2011 174 f
16 helianthus deserticola partially bimodal 2011 180 m
17 helianthus bolanderi partially bimodal 2009 174 f
18 helianthus bolanderi partially bimodal 2009 174 f
19 helianthus bolanderi partially bimodal 2009 180 m
20 helianthus bolanderi partially bimodal 2009 184 m
The original dataset has roughly 10k records of 10 columns, so it is quite manageable. However I do not seem to have a good way to go about this problem. Below are a number of approaches I have tried- and why I suspect they failed; again I suspect these are invalid approaches.
Finding the first male emergence date by group, filter, and slice is quite easy. I suspected that I would be able to create a string of number after this and use a %notin% (negate %in%) to remove female records which follow the date. But I am unaware of how to limit the %notin% to that subset, unless filtering out the dataset. However, this approach seems poor and would create making multiple intermediates.
first_male_emergence <- df %>% dplyr::filter(generations == 'partially bimodal') %>%
dplyr::group_by(species, year, sex) %>%
dplyr::filter(sex == 'm') %>%
dplyr::slice_min(sampling_doy, n=1)
I have also tried to create a 'two sided' filter, which seems out of line with dplyr philosophy. I think the issue with this approach is getting dplyr to recognize the main filtering criteria, in this instance the '<=' sign.
clean_df <- raw_df %>%
group_by(species, year, sex) %>%
dplyr::filter(sex == 'f' & sampling_doy <= print(filter(sex == 'm',(slice_min(sampling_doy, n = 1)))))
Note that I have 'print' on the expression to the right, in an attempt to return the numeric value for the <= operator to evaluate, which does not work. I tried multiple ways to print this value so that the operator was not reacting to the verb filter, i.e. variants of {.} but to no avail (I suspect I was doing this wrong). I played around with nesting each side of the statement in () to nest, but to no avail.
Finally I have tried using case_when, however there is also an issue with allowing the function to determing where the LHS and RHS are due to many operators.
clean_df <- raw_df %>%
group_by(species, year, sex) %>%
mutate(sampling_doy_1 = case_when(
(sex == 'f' & sampling_doy <=
filter(sex == 'm',(slice_max(sampling_doy, n = 1)) ~ sampling_doy,
(sex == 'f' & sampling_doy >=
filter(sex == 'm',(slice_max(sampling_doy, n = 1)) ~ NA,
))))))
I also tried a variant on this:
clean_df <- raw_df %>%
dplyr::filter(generations == 'partially bivoltine') %>%
group_by(species, year, sex) %>%
mutate(sampling_doy_1 = case_when(
sex == 'f' & sampling_doy < sex == 'm', sampling_doy ~ sampling_doy,
sex == 'f' & sampling_doy > sex == 'm', sampling_doy ~ NA,
))
I have also considered using arrange within the group by, and trying to slice all female records before the first male record. However, that does not seem like a good approach.
So my first question is: Can anyone solve this problem? It seems to me that using case_when and turning the left hand expression into a function which is called within case_when is the best course of action. However, at the same time I feel like that is fundamentally not how case_when is set up to be used based on other examples I have found. I sometimes finangle some very simple maths into it, but usually quite simple maths which acts the same across the columns of interest.
the second question is: is there a generally suggestable approach for this topic, or do I have to rely on writing functions for something like this?
Apologies for the length of this post, but any help is very appreciated. I have also tagged datatable as it seems it may have a nice simple solution.
This can be acomplished with a join.
If you don't care about
dbplyr
compatibility, etc you can probably get even more concise, e.g.: