R dplyr Summarising based condition

1.5k views Asked by At

I have a data set of items downloaded from a website based on reports we generate. The idea is to remove reports that are no longer needed based on the number of downloads. The logic is basically count all the reports for the last year that have been downloaded, check if they are outside of two absolute deviations around the median for the current year, check if the report has been downloaded within the last 4 weeks and if so how many times

I have the code below which doesn't work, I was wondering if anyone can help It gives me the error: for the n_recent_downloads section

Error in FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables

reports <- c("Report_A","Report_B","Report_C","Report_D","Report_A","Report_A","Report_A","Report_D","Report_D","Report_D")
Week_no <- c(36,36,33,32,20,18,36,30,29,27)

New.Downloads <- data.frame (Report1 = reports, DL.Week =  Week_no)


test <- New.Downloads %>%
  group_by(report1) %>%
  summarise(n_downloads = n(),
        n_recent_downloads = ifelse(sum((as.integer(DL.Week) >= (as.integer(max(DL.Week))) - 4),value,0)))
1

There are 1 answers

0
Matt Upson On BEST ANSWER

Providing a reproducible example would make life a lot easier. Nonetheless I have modified your code to do what I think you were trying to achieve.

I've split it into two so you can see what is going on. I moved the ifelsestatement to a mutate call which gives:

library(dplyr)

New.Downloads <- data.frame(
  Report1 = c("Report_A","Report_B","Report_C","Report_D","Report_A","Report_A","Report_A","Report_D","Report_D","Report_D"), 
  DL.Week = as.numeric(c(36,36,33,32,20,18,36,30,29,27))
)

test <- New.Downloads %>%
  group_by(Report1) %>%
  mutate(
    median = median(DL.Week),
    mad = 2 * mad(DL.Week),
    check = ifelse(DL.Week > median + mad | DL.Week < median - mad, 0, DL.Week)
  ) 

test

Source: local data frame [10 x 5]
Groups: Report1

    Report1 DL.Week median     mad check
1  Report_A      36   28.0 23.7216    36
2  Report_B      36   36.0  0.0000    36
3  Report_C      33   33.0  0.0000    33
4  Report_D      32   29.5  4.4478    32
5  Report_A      20   28.0 23.7216    20
6  Report_A      18   28.0 23.7216    18
7  Report_A      36   28.0 23.7216    36
8  Report_D      30   29.5  4.4478    30
9  Report_D      29   29.5  4.4478    29
10 Report_D      27   29.5  4.4478    27

Note that from your example none of the values are classed as extreme relative to the median + 2 * mad criterion, so the check values are identical to DL.week.

You can then chain a summarise onto the end of this to give you the sums.

test %>%
  summarise(
    n_recent_downloads = sum(check)
  )

Source: local data frame [4 x 2]

   Report1 n_recent_downloads
1 Report_A                110
2 Report_B                 36
3 Report_C                 33
4 Report_D                118