How to plot bar chart of monthly deviations from annual mean?

1.4k views Asked by At

SO!

I am trying to create a plot of monthly deviations from annual means for temperature data using a bar chart. I have data across many years and I want to show the seasonal behavior in temperatures between months. The bars should represent the deviation from the annual average, which is recalculated for each year. Here is an example that is similar to what I want, only it is for a single year:

Alaska Temperatures

My data is sensitive so I cannot share it yet, but I made a reproducible example using the txhousing dataset (it comes with ggplot2). The salesdiff column is the deviation between monthly sales (averaged acrross all cities) and the annual average for each year. Now the problem is plotting it.

library(ggplot2)
df <- aggregate(sales~month+year,txhousing,mean)

df2 <- aggregate(sales~year,txhousing,mean)

df2$sales2 <- df2$sales #RENAME sales
df2 <- df2[,-2] #REMOVE sales

df3<-merge(df,df2) #MERGE dataframes

df3$salesdiff <- df3$sales - df3$sales2 #FIND deviation between monthly and annual means

#plot deviations
ggplot(df3,aes(x=month,y=salesdiff)) +
         geom_col()

My ggplot is not looking good at the moment-

enter image description here

Somehow it is stacking the columns for each month with all of the data across the years. Ideally the date would be along the x-axis spanning many years (I think the dataset is from 2000-2015...), and different colors depending on if salesdiff is higher or lower. You are all awesome, and I would welcome ANY advice!!!!

2

There are 2 answers

1
Marcus Campbell On BEST ANSWER

Probably the main issue here is that geom_col() will not take on different aesthetic properties unless you explicitly tell it to. One way to get what you want is to use two calls to geom_col() to create two different bar charts that will be combined together in two different layers. Also, you're going to need to create date information which can be easily passed to ggplot(); I use the lubridate() package for this task.

Note that we combine the "month" and "year" columns here, and then useymd() to obtain date values. I chose not to convert the double valued "date" column in txhousing using something like date_decimal(), because sometimes it can confuse February and January months (e.g. Feb 1 gets "rounded down" to Jan 31).

I decided to plot a subset of the txhousing dataset, which is a lot more convenient to display for teaching purposes.

Code:

library("tidyverse")
library("ggplot2")

# subset txhousing to just years >= 2011, and calculate nested means and dates
housing_df <- filter(txhousing, year >= 2011) %>%
  group_by(year, month) %>%
  summarise(monthly_mean = mean(sales, na.rm = TRUE),
            date = first(date)) %>%
  mutate(yearmon = paste(year, month, sep = "-"),
         date = ymd(yearmon, truncated = 1), # create date column
         salesdiff = monthly_mean - mean(monthly_mean), # monthly deviation
         higherlower = case_when(salesdiff >= 0 ~ "higher", # for fill aes later
                                 salesdiff < 0 ~ "lower"))

ggplot(data = housing_df, aes(x = date, y = salesdiff, fill = as.factor(higherlower))) +
  geom_col() +
  scale_x_date(date_breaks = "6 months",
               date_labels = "%b-%Y") +
  scale_fill_manual(values = c("higher" = "blue", "lower" = "red")) +
  theme_bw()+
  theme(legend.position = "none") # remove legend

Plot:

enter image description here

You can see the periodic behaviour here nicely; an increase in sales appears to occur every spring, with sales decreasing during the fall and winter months. Do keep in mind that you might want to reverse the colours I assigned if you want to use this code for temperature data! This was a fun one - good luck, and happy plotting!

0
RLave On

Something like this should work?

Basically you need to create a binary variable that lets you change the color (fill) if salesdiff is positive or negative, called below factordiff.

Plus you needed a date variable for month and year combined.

library(ggplot2)
library(dplyr)

df3$factordiff <- ifelse(df3$salesdiff>0, 1, 0) # factor variable for colors

df3 <- df3 %>% 
  mutate(date = paste0(year,"-", month), # this builds date like "2001-1"
         date = format(date, format="%Y-%m")) # here we create the correct date format

#plot deviations
ggplot(df3,aes(x=date,y=salesdiff, fill = as.factor(factordiff))) +
  geom_col()

enter image description here

Of course this results in a hard to read plot because you have lots of dates, you can subset it and show only a restricted time:

df3 %>% 
  filter(date >= "2014-1") %>% # we filter our data from 2014
  ggplot(aes(x=date,y=salesdiff, fill = as.factor(factordiff))) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # adds label rotation

enter image description here