Showing 80% line Boxplot in r

1k views Asked by At

I have a large dataset (25,00 records) of only numbers and I'd like to know its 80% threshold. Whenever I try to run boxplot(data) on it, the result is not clear because I have so many values:

Unclear

I tried to go around it and came up with boxplot(data, outline = FALSE) and the result is below:

Clear

So basically I was wondering about two things:

1) The upper line in the second picture represents 75% correct? If so, then how I can change the 75% to 80%.

2) Is my way of not showing the outliers is correct? Or maybe there is a more simple way of showing just the 80% threshold?

Sorry for not providing a reproducible example but I will edit if it will be necessary.

3

There are 3 answers

0
Gavin Simpson On BEST ANSWER

Not that I would suggest you do this, but you can change the statistical summary used to draw the boxplot, and replace any of the stats with your own statistics.

For example, to do as you asked and draw the upper bound of the box at the 0.8 quantile of the distribution of x (instead of at the upper hinge), then we would do:

set.seed(10)
x <- rnorm(1000)
bps <- boxplot(x, plot = FALSE)

bps now contains:

> bps
$stats
             [,1]
[1,] -2.643849768
[2,] -0.677466940
[3,] -0.003001333
[4,]  0.728515646
[5,]  2.700636623

$n
[1] 1000

$conf
            [,1]
[1,] -0.07324983
[2,]  0.06724716

$out
[1] -3.001431  3.541140  3.012906 -3.012164

$group
[1] 1 1 1 1

$names
[1] "1"

and it is the stats component that we want to change

## Replace the 4th statistic (upper box)
bps$stats[4, 1] <- quantile(x, probs = 0.8)

Now we use the bxp() function to do the plotting - boxplot() uses this function internally, so we are not doing anything different here

bxp(bps)

This would give us:

enter image description here

The red line was added using

abline(h = fivenum(x)[4], col = "red", lty = "dashed", lwd = 2)

and is the upper hinge, i.e. where boxplot() would have drawn the upper edge of the box.

The reason I say I wouldn't suggest you do this as it changes the interpretation of the plot from what people would expect. If you do modify the boxplot, you'll need to explain your new conventions to the people to whom the plot is shown.

If we were being more careful about this, we'd also push the whiskers farther out now because the size of the box has increased. This will replace the lower and upper whiskers

bps$stats[c(1,5), 1] <- bps$stats[c(2,4), 1] + 
    (c(-1,1) * (1.5 * diff(bps$stats[c(2,4), 1])))

and replotting:

bxp(bps)
abline(h = fivenum(x)[4], col = "red", lty = "dashed", lwd = 2)

gives

enter image description here

Although now you have the problem of the outliers. So we should fix that too

ind <- bps$out > bps$stats[1,1] & bps$out < bps$stats[5,1]
bps$out <- bps$out[!ind]
bps$group <- bps$group[!ind]

which finally produces

enter image description here

Putting all of this together we have:

set.seed(10)
x <- rnorm(1000)
bps <- boxplot(x, plot = FALSE)
bps$stats[4, 1] <- quantile(x, probs = 0.8)
bps$stats[c(1,5), 1] <- bps$stats[c(2,4), 1] + 
    (c(-1,1) * (1.5 * diff(bps$stats[c(2,4), 1])))
ind <- bps$out > bps$stats[1,1] & bps$out < bps$stats[5,1]
bps$out <- bps$out[!ind]
bps$group <- bps$group[!ind]
1
Nick Kennedy On

You could achieve this by generating the stats yourself and using bxp to plot it.

x <- rnorm(1000)
myBox <- list(stats = matrix(quantile(x, c(0, 0.2, 0.5, 0.8, 1), ncol = 1)), n = 1000)
bxp(myBox)

This will have the hinges at the 20th and 80th centiles and the whiskers at the min and max. For your data, you may want alternative whiskers which you could create either with alternative quantiles or as some distance relative to the range between the 20th and 80th centiles. For the latter, you could do:

myBox2 <- list(stats = matrix(quantile(x, c(0, 0.2, 0.5, 0.8, 1), ncol = 1)), n = 1000)
myBox2$stats[c(1, 5)] <- myBox2$stats[c(2, 4)] + seq(-1, 1, 2) * diff(llmyBox2$stats[c(2, 4)])

Just to add, I entirely agree with Gavin that you would need to explain to anyone looking at this modified boxplot that it differed from the usual interpretation.

2
mts On

Answer1: the upper vertical line of the box shows the 3rd quartile which can but need not be very different from the 75% quantile (see comments below). If you are not sure, check with fivenum(data[,1]) and quantile(data[,1]). Nick K and Gavin Simpson have shown you in their answers how to change that and I agree with them that it is dangerous to do so if you have no idea of boxplots. The most standard thing to do would be to change the range argument of boxplot but you'll need to indicate that.

Answer2: To give justice to your data you should show both plots: the first documents the outliers, the second your quantiles. To better represent all the data in one plot, you could try logscale:

boxplot(data, log="y")

this will give you an error with 0 values (as you likely have) so you could modify your data first:

data[which(data<= 1e-3)] = 1e-3
boxplot(data, log="y")

You have to check for your data if that gives you a nice plot.