Add shaded standard error curves to geom_density in ggplot2

2.8k views Asked by At

I would like to add shaded standard error curves to geom_density using ggplot2. My code looks like this:

data.plot <- data.frame(x = c(rnorm(100, mean = 0, sd = 5),
                                    rnorm(100, mean = 1, sd =2 )), 
                             g = factor(c(rep(1, 100),
                                        rep(2,100))))

ggplot(data.plot, aes(x, linetype = g)) + geom_density()

I couldn't find a tutorial or examples to do so. Thank you.

1

There are 1 answers

0
Brian On BEST ANSWER

The best solution is with bootstrapping, as mentioned in the comments. I'll use the classic iris data, focusing on the density of Sepal.Length for each Species.

Generate samples

library(dplyr)
library(broom)
library(tidyr)
library(ggplot2)


data_frame(bs = 1:1000) %>% group_by(bs) %>% 
  mutate(data = list(iris %>% group_by(Species) %>% 
                       sample_frac(size = 1, replace = T)))
# A tibble: 1,000 x 2
# Groups:   bs [1,000]
      bs               data
   <int>             <list>
 1     1 <tibble [150 x 5]>
 2     2 <tibble [150 x 5]>
 3     3 <tibble [150 x 5]>
 4     4 <tibble [150 x 5]>
 5     5 <tibble [150 x 5]>
 6     6 <tibble [150 x 5]>
 7     7 <tibble [150 x 5]>
 8     8 <tibble [150 x 5]>
 9     9 <tibble [150 x 5]>
10    10 <tibble [150 x 5]>
# ... with 990 more rows

So I just made 1000 bootstrap replicates of the original data, taking the same number of rows out of each group as its original sample size, with replacement. Now I have to unnest to access the data in the nested data column.

Compute within-sample density

densities.within <-
data_frame(bs = 1:1000) %>% group_by(bs) %>% 
  mutate(data = list(iris %>% group_by(Species) %>% 
                       sample_frac(size = 1, replace = T))) %>% 
  unnest() %>% 
  group_by(bs, Species) %>% 
  do(tidy(density(.$Sepal.Length, 
                  from = min(iris$Sepal.Length), 
                  to = max(iris$Sepal.Length), 
                  n = 128)))
# A tibble: 384,000 x 4
# Groups:   bs, Species [30,000]
      bs Species        x         y
   <int>  <fctr>    <dbl>     <dbl>
 1     1  setosa 4.300000 0.2395786
 2     1  setosa 4.328346 0.2821128
 3     1  setosa 4.356693 0.3235939
 4     1  setosa 4.385039 0.3632449
 5     1  setosa 4.413386 0.4010378
 6     1  setosa 4.441732 0.4375189
 7     1  setosa 4.470079 0.4734727
 8     1  setosa 4.498425 0.5095333
 9     1  setosa 4.526772 0.5459280
10     1  setosa 4.555118 0.5824587
# ... with 383,990 more rows

So we expanded the data to its long form, then took the density of each group's Sepal.Length within Species within bs. We had to supply a manual from = and to = since the min and max within each bootstrap might differ (and set a lower n = than the default 512 just to save time). To simplify the S3: density object that's generated, we use broom::tidy. This is the computationally intensive step, so we'll save this object as densities.within.

Summarize densities into quantiles

This results in columns named x and y, but we'll rename these to match our data. Then we'll figure out: for the densities calculated at each computed possible Sepal.Length, what is the lower end of the CI, the median, and the upper end of the CI? We'll use quantile to get these specific values of the computed densities.

densities.qtiles <-
densities.within %>%
  rename(Sepal.Length = x, dens = y) %>%
  ungroup() %>%
  group_by(Species, Sepal.Length) %>% 
  summarise(q05 = quantile(dens, 0.025),
            q50 = quantile(dens, 0.5),
            q95 = quantile(dens, 0.975)) 
# A tibble: 384 x 5
# Groups:   Species [?]
   Species Sepal.Length        q05       q50       q95
    <fctr>        <dbl>      <dbl>     <dbl>     <dbl>
 1  setosa     4.300000 0.05730022 0.2355335 0.4426299
 2  setosa     4.328346 0.08177850 0.2734463 0.4970097
 3  setosa     4.356693 0.09863062 0.3114570 0.5505578
 4  setosa     4.385039 0.12459033 0.3430645 0.5884523
 5  setosa     4.413386 0.15049699 0.3705389 0.6207344
 6  setosa     4.441732 0.17494889 0.4006335 0.6418923
 7  setosa     4.470079 0.19836510 0.4258006 0.6655006
 8  setosa     4.498425 0.21106857 0.4555755 0.6971370
 9  setosa     4.526772 0.23399070 0.4813130 0.7244413
10  setosa     4.555118 0.24863090 0.5108057 0.7708114
# ... with 374 more rows

Visualize and compare

ggplot(densities.qtiles, aes(Sepal.Length, q50)) +
  facet_wrap(~Species, nrow = 2) +
  geom_histogram(data = iris, 
                 aes(Sepal.Length, ..density..), 
                 colour = "black", fill = "white", 
                 binwidth = 0.25, boundary = 0) +
  geom_ribbon(aes(ymin = q05, ymax = q95), alpha = 0.5, fill = "grey50") +
  stat_density(data = iris, 
               aes(Sepal.Length, ..density.., color = "raw density"), 
               size = 2, geom = "line") +
  geom_line(size = 1.5, aes(color = "bootstrapped")) + 
  scale_color_manual(values = c("red", "black")) +
  labs(y = "density") +
  theme(legend.position = c(0.5,0),
        legend.justification = c(0,0))

enter image description here

I included histogram and density layers for the original data as well for comparison. You can see that the median and raw densities are very close with 1000 bootstrap samples.