r remove outliers from a list of data.frames and make a new list of data.frames?

Question

r remove outliers from a list of data.frames and make a new list of data.frames?

469 views Asked by Ray Kodiak At 22 December 2016 at 22:57

I have a List of 6 in a data.frame

It has 3 columns:

id, T_C, Sales

T_C is TEST or CONTROL

Someone helped me here and I learned how to find the mean() and sd() by looping, instead of doing individual statements.

Now my goal is to remove the outliers from the 6 lists and produce a List of 6 (after removing outliers).

str(dfList) # this is the list of 6 in data.frames

I am able to get the mean() and sd() of each list like this:

list_mean_sd <- lapply(dfList,
                       function(df) 
                        {
                         df %>%
                           group_by(TC_INDICATOR) %>%
                           summarise(mean = mean(NET_SPEND),
                                     sd = sd(NET_SPEND))
                        })

> str(list_mean_sd)
List of 6  (1 obs. of  2 variables:)

I can selected them individually for mean or sd:

sapply(list_mean_sd, "[", "mean")
sapply(list_mean_sd, "[", "sd")

Basically, my goal is to id the outliers and remove them, product an alternative set, or after-set.

**outliers are:  mean - 3*sd()  or  mean + 3*sd()

I have this done, but with more manually steps, looking to learn how to loop through these sets and stuff like that, thanks in advance for helping me!

Original Q&A

There are 1 answers

**Nick Criswell** · Accepted Answer · 2016-12-22T23:54:01+00:00

Give this a shot. First I create data which I split into six data frames which are housed in a list.

set.seed(0)
test_data <- data.frame(id = 1:10000, 
                        T_C = sample(c(TRUE, FALSE), size = 10000, replace = TRUE),
                        Sales = rnorm(n = 10000),
                        grp = sample(c("a", "b", "c", "d", "e", "f"), 
                                     size = 10000, replace = TRUE))

test_split <- split(test_data, test_data$grp)

Then, I use lapply on this list to identify what I'm calling the z_scores which are computed as the difference between the mean of Sales and each individual Sales divided by the sd of Sales. Finally, we use filter on these to pull out the ones which have a z_score with an absolute value over 3.

library(dplyr)
outlier_list <- lapply(test_split, 
       function(m) group_by(m, T_C) %>% mutate(z_score = (Sales - mean(Sales)) / sd(Sales)) %>%
         ungroup() %>% filter(abs(z_score) >= 3)
)

> outlier_list
$a
# A tibble: 5 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1   468  TRUE -2.995332      a -3.073314
2  3026  TRUE  3.028495      a  3.075258
3  5188  TRUE -3.097847      a -3.177952
4  7993 FALSE -3.571076      a -3.823983
5  9105  TRUE -3.216710      a -3.299276

$b
# A tibble: 6 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1   264  TRUE  3.003494      b  3.003329
2  2172  TRUE  3.001475      b  3.001326
3  2980 FALSE -3.176356      b -3.222782
4  3366 FALSE  3.009292      b  3.048559
5  7477 FALSE  3.348301      b  3.392265
6  7583  TRUE -3.089758      b -3.040911

$c
# A tibble: 2 × 5
     id   T_C    Sales    grp  z_score
  <int> <lgl>    <dbl> <fctr>    <dbl>
1  8078  TRUE 3.015343      c 3.129923
2  8991 FALSE 3.113526      c 3.058302

$d
# A tibble: 5 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1   544  TRUE  3.289070      d  3.168235
2  3791 FALSE  3.791938      d  3.769810
3  6771 FALSE -3.157741      d -3.166861
4  7864  TRUE  3.164128      d  3.045728
5  9371  TRUE -3.026884      d -3.024655

$e
# A tibble: 6 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1   186 FALSE  3.021541      e  3.046079
2  1211  TRUE  3.414337      e  3.343521
3  1665  TRUE  3.546282      e  3.473614
4  3765 FALSE  3.363641      e  3.391142
5  4172  TRUE  3.348820      e  3.278923
6  7973 FALSE -2.987790      e -3.015284

$f
# A tibble: 6 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1  1089  TRUE -3.195090      f -3.189979
2  2452 FALSE  3.287591      f  3.212317
3  3486 FALSE -3.334942      f -3.367962
4  4198 FALSE -3.102578      f -3.137082
5  8183  TRUE  3.081077      f  3.075324
6  8656  TRUE  3.253873      f  3.247822

Obviously, this will give you only the outliers. If you want to keep only the inliers, you change the >= 3 to a < 3.

Updated to get Wilcox test on inliers

inlier_list <- lapply(test_split, 
                       function(m) group_by(m, T_C) %>% 
                        mutate(z_score = (Sales - mean(Sales)) / sd(Sales)) %>%
                         ungroup() %>% filter(abs(z_score) < 3)
)

We just run lapply on the list of inliers using the parameters noted in OP's comment.

wilcox_test_res <- lapply(inlier_list, 
                          function(m) wilcox.test(m$Sales ~ m$T_C, 
                                                  mu= mean(m$Sales[m$T_C == TRUE]), 
                                                  conf.level=0.95,

TechQA.

r remove outliers from a list of data.frames and make a new list of data.frames?

There are 1 answers

Updated to get Wilcox test on inliers

Related Questions in R

Related Questions in FUNCTION

Related Questions in FUNCTIONAL-PROGRAMMING

Related Questions in OUTLIERS

Related Questions in IDENTIFY

Popular Questions

Popular Tags

Trending Questions