Remove entire observations that are outliers in panel data

630 views Asked by At

I have a dataset that shows the revenue over 20 years of around 100.000 companies. The data has many other variables, but, below, I'm writing a reproducible version of a simplified sample of this dataset in a long format.

my_data <- data.frame(Company = c("A","A","A","B", "B", "B", "C", "C", "C"), years = c("2010", "2011", "2012","2010", "2011", "2012","2010", "2011", "2012"), revenue = c(10, 11, 12, 10, 12, 11, 11, 10, 956))

My data is much longer, but it follows the structure of this small sample, where the variable 'company' is my ID, and my longitudinal data comes from the revenues per year.

As you can see in this small sample, the revenue of the company C in the year 2012 was an outlier, and I would like to remove not only the observation of that year, but all observations related to company C. What is the best way to remove all observations of company C based on the value of one (or more) revenue observations?

In the real data, I use the quantile 99th to determine the outliers, with a code such as:

my_data$outliers <- ifelse(my_data$revenue >= quantile(my_data$revenue, .0.99, na.rm = TRUE), TRUE, FALSE)

That gives me a column that identifies the outliers in their respective years. I just don't know how to remove the other observations of the same company.

2

There are 2 answers

1
Ricardo Semião On

First, my_data$revenue >= quantile(my_data$revenue, .0.99, na.rm = TRUE) already returns TRUE of FALSE, so no need for the ifelse.

You can use this result to remove rows from de data set by:

index = my_data$revenue >= quantile(my_data$revenue, .0.99, na.rm = TRUE)
company = my_data[,1]==my_data[which(index),1]
my_data[-which(company),]
0
Ronak Shah On

You can get the corresponding Company where revenue is above 99th quantile and remove all the rows of those company.

In base R, you can do this as :

subset(my_data, !Company %in% unique(Company[revenue > 
                 quantile(revenue, 0.99, na.rm = TRUE)]))

#  Company years revenue
#1       A  2010      10
#2       A  2011      11
#3       A  2012      12
#4       B  2010      10
#5       B  2011      12
#6       B  2012      11

If you prefer dplyr you can write the same code in dplyr as well :

library(dplyr)

my_data %>%
  filter(!Company %in% unique(Company[revenue > 
                       quantile(revenue, 0.99, na.rm = TRUE)]))