r remove outliers from a list of data.frames and make a new list of data.frames?

472 views Asked by At

I have a List of 6 in a data.frame

It has 3 columns:

id, T_C, Sales

T_C is TEST or CONTROL

Someone helped me here and I learned how to find the mean() and sd() by looping, instead of doing individual statements.

Now my goal is to remove the outliers from the 6 lists and produce a List of 6 (after removing outliers).

str(dfList) # this is the list of 6 in data.frames

I am able to get the mean() and sd() of each list like this:

list_mean_sd <- lapply(dfList,
                       function(df) 
                        {
                         df %>%
                           group_by(TC_INDICATOR) %>%
                           summarise(mean = mean(NET_SPEND),
                                     sd = sd(NET_SPEND))
                        })

> str(list_mean_sd)
List of 6  (1 obs. of  2 variables:)

I can selected them individually for mean or sd:

sapply(list_mean_sd, "[", "mean")
sapply(list_mean_sd, "[", "sd")

Basically, my goal is to id the outliers and remove them, product an alternative set, or after-set.

**outliers are:  mean - 3*sd()  or  mean + 3*sd()

I have this done, but with more manually steps, looking to learn how to loop through these sets and stuff like that, thanks in advance for helping me!

1

There are 1 answers

7
Nick Criswell On BEST ANSWER

Give this a shot. First I create data which I split into six data frames which are housed in a list.

set.seed(0)
test_data <- data.frame(id = 1:10000, 
                        T_C = sample(c(TRUE, FALSE), size = 10000, replace = TRUE),
                        Sales = rnorm(n = 10000),
                        grp = sample(c("a", "b", "c", "d", "e", "f"), 
                                     size = 10000, replace = TRUE))

test_split <- split(test_data, test_data$grp)

Then, I use lapply on this list to identify what I'm calling the z_scores which are computed as the difference between the mean of Sales and each individual Sales divided by the sd of Sales. Finally, we use filter on these to pull out the ones which have a z_score with an absolute value over 3.

library(dplyr)
outlier_list <- lapply(test_split, 
       function(m) group_by(m, T_C) %>% mutate(z_score = (Sales - mean(Sales)) / sd(Sales)) %>%
         ungroup() %>% filter(abs(z_score) >= 3)
)

> outlier_list
$a
# A tibble: 5 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1   468  TRUE -2.995332      a -3.073314
2  3026  TRUE  3.028495      a  3.075258
3  5188  TRUE -3.097847      a -3.177952
4  7993 FALSE -3.571076      a -3.823983
5  9105  TRUE -3.216710      a -3.299276

$b
# A tibble: 6 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1   264  TRUE  3.003494      b  3.003329
2  2172  TRUE  3.001475      b  3.001326
3  2980 FALSE -3.176356      b -3.222782
4  3366 FALSE  3.009292      b  3.048559
5  7477 FALSE  3.348301      b  3.392265
6  7583  TRUE -3.089758      b -3.040911

$c
# A tibble: 2 × 5
     id   T_C    Sales    grp  z_score
  <int> <lgl>    <dbl> <fctr>    <dbl>
1  8078  TRUE 3.015343      c 3.129923
2  8991 FALSE 3.113526      c 3.058302

$d
# A tibble: 5 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1   544  TRUE  3.289070      d  3.168235
2  3791 FALSE  3.791938      d  3.769810
3  6771 FALSE -3.157741      d -3.166861
4  7864  TRUE  3.164128      d  3.045728
5  9371  TRUE -3.026884      d -3.024655

$e
# A tibble: 6 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1   186 FALSE  3.021541      e  3.046079
2  1211  TRUE  3.414337      e  3.343521
3  1665  TRUE  3.546282      e  3.473614
4  3765 FALSE  3.363641      e  3.391142
5  4172  TRUE  3.348820      e  3.278923
6  7973 FALSE -2.987790      e -3.015284

$f
# A tibble: 6 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1  1089  TRUE -3.195090      f -3.189979
2  2452 FALSE  3.287591      f  3.212317
3  3486 FALSE -3.334942      f -3.367962
4  4198 FALSE -3.102578      f -3.137082
5  8183  TRUE  3.081077      f  3.075324
6  8656  TRUE  3.253873      f  3.247822

Obviously, this will give you only the outliers. If you want to keep only the inliers, you change the >= 3 to a < 3.

Updated to get Wilcox test on inliers

inlier_list <- lapply(test_split, 
                       function(m) group_by(m, T_C) %>% 
                        mutate(z_score = (Sales - mean(Sales)) / sd(Sales)) %>%
                         ungroup() %>% filter(abs(z_score) < 3)
)

We just run lapply on the list of inliers using the parameters noted in OP's comment.

wilcox_test_res <- lapply(inlier_list, 
                          function(m) wilcox.test(m$Sales ~ m$T_C, 
                                                  mu= mean(m$Sales[m$T_C == TRUE]), 
                                                  conf.level=0.95,