I have a ggplot bar chart with 700 bars, I want to a a Pareto line, it works but the y scale for the bars is too small so they don't show up in chart

308 views Asked by At

Thanks to a lot of people I have my charts working being new to R.

I have three charts

Random bars

plot frequency ordered

Bars ordered by frequency

plot Pareto overlay

Pareto overlay if You look closely you can see the scaled ordered frequency chart is there at the bottom.

df <- filter(df_clean_distances, end_station_name != "NA" )
d <-df %>% select( end_station_name) %>%
group_by(end_station_name) %>%
summarize( freq = n())
head(d$freq )
d2 <- d[ order(-d$freq),]

plot random

ggplot(d2, aes( x=end_station_name, y= freq)) + 
geom_bar( stat = "identity") + 
theme( axis.text.x = element_blank()) +
  ylim( c(0,40000))

plot freq ordered

 ggplot(d2, aes( x=reorder(end_station_name,-freq), y= freq)) +    
    geom_bar( stat = "identity") +   
    theme(axis.text.x = element_blank()) +   
    ylim( c(0,40000))+
    labs( title = "end station by freq", x = "Station Name")

Plot with Pareto overlay


ggplot(d2, aes( x=reorder(end_station_name,-freq), y= freq)) +    
geom_bar( stat = "identity") +   theme(axis.text.x = element_blank()) +  
ggQC::stat_pareto( point.color = "red", point.size = 0.5) +
labs( title = "end station by freq", x = "Station Name") 

dput(head) output

> dput(head(d, n=20))
  structure(list(end_station_name = c("2112 W Peterson Ave", "63rd St 
  "900 W Harrison St", "Aberdeen St & Jackson Blvd", "Aberdeen St & 
   Monroe St", 
  "Aberdeen St & Randolph St", "Ada St & 113th St", "Ada St & 
   Washington Blvd", 
  "Adler Planetarium", "Albany Ave & 26th St", "Albany Ave & 
   Bloomingdale Ave", 
  "Albany Ave & Montrose Ave", "Archer (Damen) Ave & 37th St", 
  "Artesian Ave & Hubbard St", "Ashland Ave & 13th St", "Ashland Ave & 
  50th St", 
  "Ashland Ave & 63rd St", "Ashland Ave & 66th St", "Ashland Ave & 
   69th St", 
  "Ashland Ave & 73rd St"), freq = c(1032L, 2524L, 3836L, 8383L, 
  6587L, 6136L, 18L, 6281L, 12050L, 397L, 2833L, 1875L, 710L, 1879L, 
  2659L, 151L, 112L, 102L, 78L, 8L)), row.names = c(NA, -20L), class = 
  c("tbl_df", "tbl", "data.frame"))

As you can see the pareto plot is working for the right hand scale but the left hand is out of wack by a lot. While there are 3 million rows the scaling on the y axis has reduced the freq to a very small curve along the bottom, it is there on the left just hard to see.

How do I fix the left y axis to limit to about 40,000 so that the frequency curve shows up correctly?


There are 1 answers

Rui Barradas On

Here is a solution but not with package ggQC, with sec_axis.
The trick is to pre-compute max(freq) and then use it as a scale factor in order to align the two axis. This data preparation code is inspired in this rstudio-pubs blog post.


M <- max(d$freq)

d %>%
  arrange(desc(freq)) %>%
  mutate(cum_freq = cumsum(freq/sum(freq))) %>%
  ggplot(aes(x = reorder(end_station_name, -freq), y = freq)) +    
  geom_bar(stat = "identity") +   
  geom_line(mapping = aes(y = cum_freq*M, group = 1)) +
    mapping = aes(y = cum_freq*M),
    color = "red", 
    size = 0.5
  ) +
    sec.axis = sec_axis(~ ./M, 
                        labels = scales::percent,
                        name = "Cummulative percentage")) +
  labs( title = "end station by freq", x = "Station Name") +
  theme(axis.text.x = element_blank())

enter image description here