I have a dataset that shows the revenue over 20 years of around 100.000 companies. The data has many other variables, but, below, I'm writing a reproducible version of a simplified sample of this dataset.
my_data <- data.frame(Company = c("A","B","C","D"), CITY = c("Paris", "Paris", "Quimper", "Nice"), year_creation = c("2010", "2009", "2008", "2009"), revenue_2008 = c(NA, NA, 10, NA),
revenue_2009 = c(NA,10, 20, 15000), revenue_2010 = c(02, 10, 2500, 20000), revenue_2011 = c(14, 16, 10, 30000),
size = c(2, 3, 5, 1))
As you can see, I'm dealing with an unbalanced panel data that has outliers both within the observations (e.g., the sudden revenue of company C in the year 2010) and in between the observations (e.g., the company D that has much higher revenues than the others, even considering I've selected companies that were supposed to be similar)...
So, my question is, what is the best way to deal with these two types of outliers in R? I imagined that for the within outliers, the data in the wide-format should be better, right? But which code can run to check the outliers line by line (i.e., observation by observation)? And for the second type of outliers? Is it better to convert the data for the long format? If yes, how could I test the outliers in the long format?
Thank you so much for your help! Best,
How to detect is mostly statistical question. One way you could use is Hampel filter (its pros and cons are not in the scope of this answer).
It considers values outside of
median ± 3*(median absolute deviation)
to be outliers.Let's imagine that we will use this criteria. You could do within and between tests through
by
argument ofdata.table
.It would make analysis easier, hence I have converted it via
melt
You could also detect and treat outliers at the same time with
Winsorize()
fromDescTools
. See details: https://en.wikipedia.org/wiki/Winsorizing