Plotting data in logarithmic scale using stat_summary: summary range reaches zero

1k views Asked by At

Problem:

The ggplot manual clearly states that scale_y_log10() and coord_trans(y="log10") behave differently. The first transforms data, the second transforms coordinates. As such, the former cannot be used if you have to add error bars to logarithmic plots, because real scale is not linear and the calculation of errors must be performed before the logarithmic transformation, not after.

However, when adding error bars (or other ranges) to some data plotted in logarithmic scale, there is the risk that some of the range will reach zero and the plot will fail. In this case, the plotted variable cannot possibly go below 0, so these values could be safely cut at an arbitrarily low number, so that the plotting doesn't fail.

Example:

Get an example dataframe:

url <- "https://copy.com/cfNBrcTLEfjiYXXa/d_dataframe.Rdata?download=1"
download.file(url, "d_dataframe.Rdata", method="curl")
load("d_dataframe.Rdata")
library(ggplot2)
p <- ggplot(d, aes(x=x, y=PDF))

See the data in log scale:

p + geom_point(aes(colour=model)) + scale_y_log10()
p + geom_point(aes(colour=model)) + coord_trans(y="log10")

All OK, the data looks the same with coord_trans() and scale_y_log10(). However if I add a summary method:

p + stat_summary(fun.data = "mean_se", mult = 2) + scale_y_log10()

The above does not correctly represent the actual standard error, but:

p + stat_summary(fun.data = "mean_se", mult = 2) + coord_trans(y="log10")

cannot be plotted, because one of the summaries (in this case I used standard errors) reaches 0 (or less) and the logarithmic scale fails with the usual:

Error in if (zero_range(range)) { : missing value where TRUE/FALSE needed

Setting a y limit does not work, because some of the pointranges are simply removed:

p + stat_summary(fun.data = "mean_se", mult = 2) + coord_trans(y="log10", limy=c(1e-10, 1.1))

I'd like these lines not to be removed, but simply to reach the bottom of the graph, to indicate that they go below 0.

I know I could manually calculate the errors, filter them to be >0, and plot them with geom_linerange() or similar. But the ability to use stat_summary() is really convenient and fast, and allows to quickly change the plotted statistic. The same consideration also applies to geom_smooth().

Is there any way I can retain the use of stat_summary()/geom_smooth() and still plot data on a logarithmic scale?

EDIT:

An example solution would be something along these lines (for a 2sd spread):

maxf <- function(dati) {mean(dati) + 2*sd(dati)}
minf <- function(dati) {max(10e-8, mean(dati) - 2*sd(dati))}

And then using:

stat_summary(fun.y = mean, fun.ymin=minf, fun.ymax=maxf) + coord_trans(y="log10")

However, this has to be adjusted every time I change my mind on the summary error (e.g. if I choose standard error instead of standard deviation), and it is impossible (?) to use this approach for geom_smooth().

1

There are 1 answers

0
Curt F. On

A simple and reasonably workable solution is to define maxf and minf as nested function definitions that take functions as their argument.

# Defining minf and maxf to take functions as input
minf <- function(f){function(dati) {max(1e-8, mean(dati) - f(dati))}}
maxf <- function(f){function(dati) {mean(dati) + f(dati)}}

# A plot that works
p + stat_summary(fun.y = mean, fun.ymin=minf(sd), fun.ymax=maxf(sd)) 
  + coord_trans(y="log10")

# If you change your mind and want something besides sd
se <- function(x){sd(x)/sqrt(length(x))}

p + stat_summary(fun.y = mean, fun.ymin=minf(se), fun.ymax=maxf(se)) 
  + coord_trans(y="log10")

There are some limitations on this approach: you still have to redefine minf and maxf if you want error bars that aren't some form of addition/subtraction to the mean, and you have to define your f functions as being able to take single vector inputs to give vector outputs. But I think it's a little nicer than always having to redefine minf and maxf all the time, for everything.