I just started working with R not long ago, and I am currently trying to strengthen my visualization skills. What I want to do is to create boxplots with mean diamonds as a layer on top (see picture in the link below). I did not find any functions that does this already, so I guess I have to create it myself.
What I was hoping to do was to create a geom or a stat that would allow something like this to work:
ggplot(data, aes(...))) +
geom_boxplot(...) +
geom_meanDiamonds(...)
I have no idea where to start in order to build this new function. I know which values are needed for the mean diamonds (mean and confidence interval), but I do not know how to build the geom / stat that takes the data from ggplot()
, calculates the mean and CI for each group, and plots a mean diamond on top of each boxplot.
I have searched for detailed descriptions on how to build these type of functions from scratch, however, I have not found anything that really starts from the bottom. I would really appreciate it, if anyone could point me towards some useful guides.
Thank you!
I'm currently learning to write geoms myself, so this is going to be a rather long & rambling post as I go through my thought processes, untangling the Geom aspects (creating polygons & line segments) from the Stats aspects (calculating where these polygons & segments should be) of a geom.
Disclaimer: I'm not familiar with this kind of plot, and Google didn't throw up many authoritative guides. My understanding of how the confidence interval is calculated / used here may be off.
Step 0. Understand the relationship between a geom / stat and a layer function.
geom_boxplot
andstat_boxplot
are examples of layer functions. If you enter them into the R console, you'll see that they are (relatively) short, and does not contain actual code for calculating the box / whiskers of the boxplot. Instead,geom_boxplot
contains a line that saysgeom = GeomBoxplot
, whilestat_boxplot
contains a line that saysstat = StatBoxplot
(reproduced below).GeomBoxplot
andStatBoxplot
are ggproto objects. They are where the magic happens.Step 1. Recognise that
ggproto()
's_inherit
parameter is your friend.Don't reinvent the wheel. Since we want to create something that overlaps nicely with a boxplot, we can take reference from the Geom / Stat used for that, and only change what's necessary.
Step 2. Modify the Stat.
There are 3 functions defined within StatBoxplot:
setup_data
,setup_params
, andcompute_group
. You can refer to the code on Github (link above) for the details, or view them by entering for exampleStatBoxplot$compute_group
.The
compute_group
function calculates the ymin / lower / middle / upper / ymax values for all the y values associated with each group (i.e. each unique x value), which are used to plot the box plot. We can override it with one that calculates the confidence interval & mean values instead:(Optional) StatBoxplot has provision for the user to include
weight
as an aesthetic mapping. We can allow for that as well, by replacing:with:
There's no need to change the other functions in StatBoxplot. So we can define StatMeanDiamonds as follows:
Step 3. Modify the Geom.
GeomBoxplot has 3 functions:
setup_data
,draw_group
, anddraw_key
. It also includes definitions fordefault_aes()
andrequired_aes()
.Since we've changed the upstream data source (the data produced by StatMeanDiamonds contain the calculated columns "lower" / "mean" / "upper", while the data produced by StatBoxplot would have contained the calculated columns "ymin" / "lower" / "middle" / "upper" / "ymax"), do check whether the downstream
setup_data
function is affected as well. (In this case,GeomBoxplot$setup_data
makes no reference to the affected columns, so no changes required here.)The
draw_group
function takes the data produced by StatMeanDiamonds and set up bysetup_data
, and produces multiple data frames. "common" contains the aesthetic mappings common to all geoms. "diamond.df" for the mappings that contribute towards the diamond polygon, and "segment.df" for the mappings that contribute towards the horizontal line segment at the mean. The data frames are then passed to thedraw_panel
functions of GeomPolygon and GeomSegment respectively, to produce the actual polygons / line segments.The
draw_key
function is used to create the legend for this layer, should the need arise. Since GeomMeanDiamonds inherits from GeomBoxplot, the default isdraw_key = draw_key_boxplot
, and we don't have to change it. Leaving it unchanged will not break the code. However, I think a simpler legend such asdraw_key_polygon
offers a less cluttered look.GeomBoxplot's
default_aes
specifications look fine. But we need to change therequired_aes
since the data we expect to get from StatMeanDiamonds is different ("lower" / "mean" / "upper" instead of "ymin" / "lower" / "middle" / "upper" / "ymax").We are now ready to define GeomMeanDiamonds:
Step 4. Define the layer functions.
This is the boring part. I copied from
geom_boxplot
/stat_boxplot
directly, removing all references to outliers ingeom_meanDiamonds
, changing togeom = GeomMeanDiamonds
/stat = StatMeanDiamonds
, and addingci = 0.95
tostat_meanDiamonds
.Step 5. Check output.