Using dodge position in ggplot changing column values

1.5k views Asked by At

Trying to created a bar chart that shows number of Impressions for each Gender and age group.

Data:

> head(ny1)
  Age Gender Impressions Clicks Signed_In age_group   hasimp        ctr  scode
1  36      0           3      0         1   (29,39] (0, Inf] 0.00000000   Imps
2  73      1           3      0         1 (69, Inf] (0, Inf] 0.00000000   Imps
3  30      0           3      0         1   (29,39] (0, Inf] 0.00000000   Imps
4  49      1           3      0         1   (39,49] (0, Inf] 0.00000000   Imps
5  47      1          11      0         1   (39,49] (0, Inf] 0.00000000   Imps
6  47      0          11      1         1   (39,49] (0, Inf] 0.09090909 Clicks


> str(ny1)
'data.frame':   458441 obs. of  9 variables:
 $ Age        : int  36 73 30 49 47 47 0 46 16 52 ...
 $ Gender     : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 1 1 1 ...
 $ Impressions: int  3 3 3 3 11 11 7 5 3 4 ...
 $ Clicks     : int  0 0 0 0 0 1 1 0 0 0 ...
 $ Signed_In  : int  1 1 1 1 1 1 0 1 1 1 ...
 $ age_group  : Factor w/ 7 levels "(-Inf,19]","(19,29]",..: 3 7 3 4 4 4 1 4 1 5 ...
 $ hasimp     : Factor w/ 2 levels "(-Inf,0]","(0, Inf]": 2 2 2 2 2 2 2 2 2 2 ...
 $ ctr        : num  0 0 0 0 0 ...
 $ scode      : Factor w/ 3 levels "Clicks","Imps",..: 2 2 2 2 2 1 1 2 2 2 ...

Now this appears to work for a stacked barchart.

ggplot(data=ny1, aes(x=age_group, y=Impressions)) +
  geom_bar(stat="identity", aes(fill = Gender))

correct impressions

But when I simply add the position = "dodge" it changes how the distribution on the y-axis is:

ggplot(data=ny1, aes(x=age_group, y=Impressions)) + 
  geom_bar(stat="identity", aes(fill = Gender), position = "dodge")

Why is the second column measuring a different number of impressions?

incorrect impressions

1

There are 1 answers

0
Z.Lin On BEST ANSWER

Your first plot is a stacked bar chart, where each observation (i.e. one row of the dataset) is represented as one thin slice of the stack. If you check the help file ?geom_bar, the default argument is position = "stack".

When you change the position argument to position = "dodge", every observation is dodged based on Gender, so the height of the bar represents the maximum Impressions value for each age group / Gender combination. You can think of it as every observation in the same age group / Gender combination forming a long queue, such that from the front, you only see one observation.

In order to plot the stack of values dodged by gender, you can calculate the summarised values first:

library(dplyr)

p <- ggplot(ny %>%
         group_by(age_group, Gender) %>%
         summarise(Impressions_total = sum(Impressions)),
       aes(x = age_group, y = Impressions_total, fill = Gender)) 

p1 <- p + geom_bar(stat = "identity")
p2 <- p + geom_bar(stat = "identity", position = "dodge")

gridExtra::grid.arrange(p1, p2, nrow = 1)
# the bar heights in the two charts match

side by side comparison

Sample data used for illustration:

set.seed(123)
ny <- data.frame(
  age_group = sample(c("00-19", "20-29", "30-39"), replace = TRUE, 20),
  Impressions = sample(5:20, replace = TRUE, 20),
  Gender = factor(sample(0:1, replace = TRUE, 20))
)

Side note: geom_col() is equivalent to geom_bar(stat = "identity"), so you can use that instead as well.