I have a large dataset (25,00 records) of only numbers and I'd like to know its 80% threshold. Whenever I try to run boxplot(data)
on it, the result is not clear because I have so many values:
I tried to go around it and came up with boxplot(data, outline = FALSE)
and the result is below:
So basically I was wondering about two things:
1) The upper line in the second picture represents 75% correct? If so, then how I can change the 75% to 80%.
2) Is my way of not showing the outliers is correct? Or maybe there is a more simple way of showing just the 80% threshold?
Sorry for not providing a reproducible example but I will edit if it will be necessary.
Not that I would suggest you do this, but you can change the statistical summary used to draw the boxplot, and replace any of the stats with your own statistics.
For example, to do as you asked and draw the upper bound of the box at the 0.8 quantile of the distribution of
x
(instead of at the upper hinge), then we would do:bps
now contains:and it is the
stats
component that we want to changeNow we use the
bxp()
function to do the plotting -boxplot()
uses this function internally, so we are not doing anything different hereThis would give us:
The red line was added using
and is the upper hinge, i.e. where
boxplot()
would have drawn the upper edge of the box.The reason I say I wouldn't suggest you do this as it changes the interpretation of the plot from what people would expect. If you do modify the boxplot, you'll need to explain your new conventions to the people to whom the plot is shown.
If we were being more careful about this, we'd also push the whiskers farther out now because the size of the box has increased. This will replace the lower and upper whiskers
and replotting:
gives
Although now you have the problem of the outliers. So we should fix that too
which finally produces
Putting all of this together we have: