I have a dataset that shows the revenue over 20 years of around 100.000 companies. The data has many other variables, but, below, I'm writing a reproducible version of a simplified sample of this dataset in a long format.
my_data <- data.frame(Company = c("A","A","A","B", "B", "B", "C", "C", "C"), years = c("2010", "2011", "2012","2010", "2011", "2012","2010", "2011", "2012"), revenue = c(10, 11, 12, 10, 12, 11, 11, 10, 956))
My data is much longer, but it follows the structure of this small sample, where the variable 'company' is my ID, and my longitudinal data comes from the revenues per year.
As you can see in this small sample, the revenue of the company C in the year 2012 was an outlier, and I would like to remove not only the observation of that year, but all observations related to company C. What is the best way to remove all observations of company C based on the value of one (or more) revenue observations?
In the real data, I use the quantile 99th to determine the outliers, with a code such as:
my_data$outliers <- ifelse(my_data$revenue >= quantile(my_data$revenue, .0.99, na.rm = TRUE), TRUE, FALSE)
That gives me a column that identifies the outliers in their respective years. I just don't know how to remove the other observations of the same company.
First,
my_data$revenue >= quantile(my_data$revenue, .0.99, na.rm = TRUE)
already returnsTRUE
ofFALSE
, so no need for theifelse
.You can use this result to remove rows from de data set by: