How to deal with outliers within and between observations in a panel data in R?

601 views Asked by At

I have a dataset that shows the revenue over 20 years of around 100.000 companies. The data has many other variables, but, below, I'm writing a reproducible version of a simplified sample of this dataset.

my_data <- data.frame(Company = c("A","B","C","D"), CITY = c("Paris", "Paris", "Quimper", "Nice"), year_creation = c("2010", "2009", "2008", "2009"), revenue_2008 = c(NA, NA, 10, NA), 
     revenue_2009 = c(NA,10, 20, 15000), revenue_2010 = c(02, 10, 2500, 20000), revenue_2011 = c(14, 16, 10, 30000),
     size = c(2, 3, 5, 1))

As you can see, I'm dealing with an unbalanced panel data that has outliers both within the observations (e.g., the sudden revenue of company C in the year 2010) and in between the observations (e.g., the company D that has much higher revenues than the others, even considering I've selected companies that were supposed to be similar)...

So, my question is, what is the best way to deal with these two types of outliers in R? I imagined that for the within outliers, the data in the wide-format should be better, right? But which code can run to check the outliers line by line (i.e., observation by observation)? And for the second type of outliers? Is it better to convert the data for the long format? If yes, how could I test the outliers in the long format?

Thank you so much for your help! Best,

1

There are 1 answers

0
Jav On BEST ANSWER

How to detect is mostly statistical question. One way you could use is Hampel filter (its pros and cons are not in the scope of this answer).

It considers values outside of median ± 3*(median absolute deviation) to be outliers.

Let's imagine that we will use this criteria. You could do within and between tests through by argument of data.table.

Is it better to convert the data for the long format?

It would make analysis easier, hence I have converted it via melt

my_data <- data.frame(Company = c("A","B","C","D"), CITY = c("Paris", "Paris", "Quimper", "Nice"), year_creation = c("2010", "2009", "2008", "2009"), revenue_2008 = c(NA, NA, 10, NA), 
                      revenue_2009 = c(NA,10, 20, 15000), revenue_2010 = c(02, 10, 2500, 20000), revenue_2011 = c(14, 16, 10, 30000),
                      size = c(2, 3, 5, 1))

library(data.table)

my_data <- as.data.table(my_data)

my_data <- melt(my_data, id.vars = c("Company", "CITY", "year_creation", "size"))

hampel_filter <- function(x){
  x_med <- median(x, na.rm = TRUE)
  x_mad <- mad(x, na.rm = TRUE)
  (x > x_med + 3*x_mad | x < x_med - 3*x_mad)
}

my_data[, between_out := hampel_filter(value), by = variable]
my_data[, within_out := hampel_filter(value), by = Company]
> my_data
    Company    CITY year_creation size     variable value between_out within_out
 1:       A   Paris          2010    2 revenue_2008    NA          NA         NA
 2:       B   Paris          2009    3 revenue_2008    NA          NA         NA
 3:       C Quimper          2008    5 revenue_2008    10       FALSE      FALSE
 4:       D    Nice          2009    1 revenue_2008    NA          NA         NA
 5:       A   Paris          2010    2 revenue_2009    NA          NA         NA
 6:       B   Paris          2009    3 revenue_2009    10       FALSE      FALSE
 7:       C Quimper          2008    5 revenue_2009    20       FALSE      FALSE
 8:       D    Nice          2009    1 revenue_2009 15000        TRUE      FALSE
 9:       A   Paris          2010    2 revenue_2010     2       FALSE      FALSE
10:       B   Paris          2009    3 revenue_2010    10       FALSE      FALSE
11:       C Quimper          2008    5 revenue_2010  2500       FALSE       TRUE
12:       D    Nice          2009    1 revenue_2010 20000        TRUE      FALSE
13:       A   Paris          2010    2 revenue_2011    14       FALSE      FALSE
14:       B   Paris          2009    3 revenue_2011    16       FALSE       TRUE
15:       C Quimper          2008    5 revenue_2011    10       FALSE      FALSE
16:       D    Nice          2009    1 revenue_2011 30000        TRUE      FALSE

You could also detect and treat outliers at the same time with Winsorize() from DescTools. See details: https://en.wikipedia.org/wiki/Winsorizing