Pareto graph in ggplot2

949 views Asked by At

I managed to create a Pareto chart, however, I would like to improve certain things but I lack the skills to do it. Maybe someone could have a quick look at the graph and let me know, if

  1. I can have on the right y-axis, where Cumulative frequencies(%) are, I could have the percentage symbol after the numbers? This way I could remove the axis title, which would be great

  2. In case number 1 is not possible, how can I make the right y-axis title bigger? The size = 12 cannot be entered and I am not sure how that would work out. I was also considering to rotate the title but again I am not sure if this is doable

  3. Is it possible to rotate the A,B,C,D... labels so that they are not vertical but horizontal?

  4. I was wondering if adding relative frequencies above the bars is an option, as well as percentages above the dots on the red curve, which represent the cumulative frequencies?

Minimal example

set.seed(42)  ## for sake of reproducibility
c <- data.frame(value=factor(paste("value", 1:n)),counts=sample(18:130, n, replace=TRUE))

Cumulative frequnecies for pareto chart

# It's maybe not the most elegant way of doing it but it works
# If someone can offer an alternative, that would be nice

df <- data.frame(c,stringsAsFactors = FALSE)

df <- df[order(df$counts,decreasing=TRUE), ]

df$value <- factor(df$value, levels=df$value)

df$cumulative <- cumsum(df$counts)

df$cumulative <- 100 * df$cumulative/tail(df$cumulative, n=1)

scaleRight <- tail(df$cumulative, n=1)/head(df$counts, n=1)

Pareto chart in ggplot

ggplot(df, aes(x=value)) +  theme_bw()+
  geom_bar(aes(y=counts, fill=value), stat="identity",show.legend = FALSE) +
  geom_path(aes(y=cumulative/scaleRight, group=1),colour="red", size=0.9) +
  geom_point(aes(y=cumulative/scaleRight, group=1),colour="red") +
  scale_y_continuous(sec.axis = sec_axis(~.*scaleRight, name = "Cumulative (%)"), n.breaks = 9) +
  theme(axis.text.x = element_text(angle=90, vjust=0.6)) +
  theme(
        legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5),
        panel.background =element_blank(),panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), axis.title.x=element_blank(),
        axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  scale_color_grey(start=0, end=.6)+scale_fill_grey()+ ylab("Counts")

Output

Pareto Chart

1

There are 1 answers

3
MrGumble On BEST ANSWER

I love your question, you have put a great deal of effort into asking a good question with a reproducible example and working code (except n wasn't defined, but usually I can count to 7).

First off, I have taken the liberty to refactor your data manipulation code using tidyverse's dplyr. It makes it much more succinct to read. I furthermore avoided multiplying your cummulative percentage with 100, and you will see why. Also, I didn't get the same values as you did.

set.seed(42)  ## for sake of reproducibility
n <- 6
c <- data.frame(value=factor(paste("value", 1:n)),counts=sample(18:130, n, replace=TRUE))
dput(c)
structure(list(value = structure(1:6, .Label = c("value 1", "value 2", 
"value 3", "value 4", "value 5", "value 6"), class = "factor"), 
    counts = c(66L, 118L, 82L, 42L, 91L, 117L)), class = "data.frame", row.names = c(NA, 
-6L))

df <- c %>%
  arrange(desc(counts)) %>%
  mutate(
    value = factor(value, levels=value),
    cumulative = cumsum(counts) / sum(counts)
  ) 

df
    value counts cumulative
1 value 2    118  0.2286822
2 value 6    117  0.4554264
3 value 5     91  0.6317829
4 value 3     82  0.7906977
5 value 1     66  0.9186047
6 value 4     42  1.0000000

The A, B, C, D labels you are referring to, I assume are the x-axis labels. These have been rotated a quarter with the command (in your code!) - it's the angle=90 that caused it.

theme(axis.text.x = element_text(angle=90, vjust=0.6))

All in all, I propose the following solution:

f <- max(df$counts) # or df$counts[1], as it is sorted descendingly

ggplot(df, aes(x=value)) +  theme_bw(base_size = 12)+
  geom_bar(aes(y=counts, fill=value), stat="identity",show.legend = FALSE) +
  geom_path(aes(y=cumulative*f, group=1),colour="red", size=0.9) +
  geom_point(aes(y=cumulative*f, group=1),colour="red") +
  scale_y_continuous("Counts", sec.axis = sec_axis(~./f, labels = scales::percent), n.breaks = 9) +
  scale_fill_grey() +
  theme(
    axis.text = element_text(size=12),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.title.x=element_blank()
  )

enter image description here

In response to questions:

Adding labels can be done with geom_text:

geom_text(aes(label=sprintf('%.0f%%', cumulative*100), y=cumulative*f), colour='red', nudge_y = 5) +
geom_text(aes(label=sprintf('%.0f%%', counts/sum(counts)*100), y=counts), nudge_y = 5) +

Note the use of nudge_y - this one may be difficult, because it works in the major y-axis scale, so nudging by "5" units here makes sense, but if your counts were in the thousands, "5" is not enough.

Please note that the solutions given here, only works as long as c (and df) contains the entire scope of values; i.e. if you 8 or 10 or more faults, but only want to show the 6 main faults, the calculations of cummulative sums and percentages will be wrong.