Trying to created a bar chart that shows number of Impressions for each Gender and age group.
Data:
> head(ny1)
Age Gender Impressions Clicks Signed_In age_group hasimp ctr scode
1 36 0 3 0 1 (29,39] (0, Inf] 0.00000000 Imps
2 73 1 3 0 1 (69, Inf] (0, Inf] 0.00000000 Imps
3 30 0 3 0 1 (29,39] (0, Inf] 0.00000000 Imps
4 49 1 3 0 1 (39,49] (0, Inf] 0.00000000 Imps
5 47 1 11 0 1 (39,49] (0, Inf] 0.00000000 Imps
6 47 0 11 1 1 (39,49] (0, Inf] 0.09090909 Clicks
> str(ny1)
'data.frame': 458441 obs. of 9 variables:
$ Age : int 36 73 30 49 47 47 0 46 16 52 ...
$ Gender : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 1 1 1 ...
$ Impressions: int 3 3 3 3 11 11 7 5 3 4 ...
$ Clicks : int 0 0 0 0 0 1 1 0 0 0 ...
$ Signed_In : int 1 1 1 1 1 1 0 1 1 1 ...
$ age_group : Factor w/ 7 levels "(-Inf,19]","(19,29]",..: 3 7 3 4 4 4 1 4 1 5 ...
$ hasimp : Factor w/ 2 levels "(-Inf,0]","(0, Inf]": 2 2 2 2 2 2 2 2 2 2 ...
$ ctr : num 0 0 0 0 0 ...
$ scode : Factor w/ 3 levels "Clicks","Imps",..: 2 2 2 2 2 1 1 2 2 2 ...
Now this appears to work for a stacked barchart.
ggplot(data=ny1, aes(x=age_group, y=Impressions)) +
geom_bar(stat="identity", aes(fill = Gender))
But when I simply add the position = "dodge" it changes how the distribution on the y-axis is:
ggplot(data=ny1, aes(x=age_group, y=Impressions)) +
geom_bar(stat="identity", aes(fill = Gender), position = "dodge")
Why is the second column measuring a different number of impressions?
Your first plot is a stacked bar chart, where each observation (i.e. one row of the dataset) is represented as one thin slice of the stack. If you check the help file
?geom_bar
, the default argument isposition = "stack"
.When you change the position argument to
position = "dodge"
, every observation is dodged based on Gender, so the height of the bar represents the maximum Impressions value for each age group / Gender combination. You can think of it as every observation in the same age group / Gender combination forming a long queue, such that from the front, you only see one observation.In order to plot the stack of values dodged by gender, you can calculate the summarised values first:
Sample data used for illustration:
Side note:
geom_col()
is equivalent togeom_bar(stat = "identity")
, so you can use that instead as well.