Add shaded standard error curves to geom_density in ggplot2

Question

Add shaded standard error curves to geom_density in ggplot2

2.9k views Asked by Andrew At 06 January 2025 at 00:29

I would like to add shaded standard error curves to geom_density using ggplot2. My code looks like this:

data.plot <- data.frame(x = c(rnorm(100, mean = 0, sd = 5),
                                    rnorm(100, mean = 1, sd =2 )), 
                             g = factor(c(rep(1, 100),
                                        rep(2,100))))

ggplot(data.plot, aes(x, linetype = g)) + geom_density()

I couldn't find a tutorial or examples to do so. Thank you.

Original Q&A

There are 1 answers

**Brian** · Accepted Answer · 2017-08-29T00:37:35+00:00

The best solution is with bootstrapping, as mentioned in the comments. I'll use the classic iris data, focusing on the density of Sepal.Length for each Species.

Generate samples

library(dplyr)
library(broom)
library(tidyr)
library(ggplot2)


data_frame(bs = 1:1000) %>% group_by(bs) %>% 
  mutate(data = list(iris %>% group_by(Species) %>% 
                       sample_frac(size = 1, replace = T)))

# A tibble: 1,000 x 2
# Groups:   bs [1,000]
      bs               data
   <int>             <list>
 1     1 <tibble [150 x 5]>
 2     2 <tibble [150 x 5]>
 3     3 <tibble [150 x 5]>
 4     4 <tibble [150 x 5]>
 5     5 <tibble [150 x 5]>
 6     6 <tibble [150 x 5]>
 7     7 <tibble [150 x 5]>
 8     8 <tibble [150 x 5]>
 9     9 <tibble [150 x 5]>
10    10 <tibble [150 x 5]>
# ... with 990 more rows

So I just made 1000 bootstrap replicates of the original data, taking the same number of rows out of each group as its original sample size, with replacement. Now I have to unnest to access the data in the nested data column.

Compute within-sample density

densities.within <-
data_frame(bs = 1:1000) %>% group_by(bs) %>% 
  mutate(data = list(iris %>% group_by(Species) %>% 
                       sample_frac(size = 1, replace = T))) %>% 
  unnest() %>% 
  group_by(bs, Species) %>% 
  do(tidy(density(.$Sepal.Length, 
                  from = min(iris$Sepal.Length), 
                  to = max(iris$Sepal.Length), 
                  n = 128)))

# A tibble: 384,000 x 4
# Groups:   bs, Species [30,000]
      bs Species        x         y
   <int>  <fctr>    <dbl>     <dbl>
 1     1  setosa 4.300000 0.2395786
 2     1  setosa 4.328346 0.2821128
 3     1  setosa 4.356693 0.3235939
 4     1  setosa 4.385039 0.3632449
 5     1  setosa 4.413386 0.4010378
 6     1  setosa 4.441732 0.4375189
 7     1  setosa 4.470079 0.4734727
 8     1  setosa 4.498425 0.5095333
 9     1  setosa 4.526772 0.5459280
10     1  setosa 4.555118 0.5824587
# ... with 383,990 more rows

So we expanded the data to its long form, then took the density of each group's Sepal.Length within Species within bs. We had to supply a manual from = and to = since the min and max within each bootstrap might differ (and set a lower n = than the default 512 just to save time). To simplify the S3: density object that's generated, we use broom::tidy. This is the computationally intensive step, so we'll save this object as densities.within.

Summarize densities into quantiles

This results in columns named x and y, but we'll rename these to match our data. Then we'll figure out: for the densities calculated at each computed possible Sepal.Length, what is the lower end of the CI, the median, and the upper end of the CI? We'll use quantile to get these specific values of the computed densities.

densities.qtiles <-
densities.within %>%
  rename(Sepal.Length = x, dens = y) %>%
  ungroup() %>%
  group_by(Species, Sepal.Length) %>% 
  summarise(q05 = quantile(dens, 0.025),
            q50 = quantile(dens, 0.5),
            q95 = quantile(dens, 0.975))

# A tibble: 384 x 5
# Groups:   Species [?]
   Species Sepal.Length        q05       q50       q95
    <fctr>        <dbl>      <dbl>     <dbl>     <dbl>
 1  setosa     4.300000 0.05730022 0.2355335 0.4426299
 2  setosa     4.328346 0.08177850 0.2734463 0.4970097
 3  setosa     4.356693 0.09863062 0.3114570 0.5505578
 4  setosa     4.385039 0.12459033 0.3430645 0.5884523
 5  setosa     4.413386 0.15049699 0.3705389 0.6207344
 6  setosa     4.441732 0.17494889 0.4006335 0.6418923
 7  setosa     4.470079 0.19836510 0.4258006 0.6655006
 8  setosa     4.498425 0.21106857 0.4555755 0.6971370
 9  setosa     4.526772 0.23399070 0.4813130 0.7244413
10  setosa     4.555118 0.24863090 0.5108057 0.7708114
# ... with 374 more rows

Visualize and compare

ggplot(densities.qtiles, aes(Sepal.Length, q50)) +
  facet_wrap(~Species, nrow = 2) +
  geom_histogram(data = iris, 
                 aes(Sepal.Length, ..density..), 
                 colour = "black", fill = "white", 
                 binwidth = 0.25, boundary = 0) +
  geom_ribbon(aes(ymin = q05, ymax = q95), alpha = 0.5, fill = "grey50") +
  stat_density(data = iris, 
               aes(Sepal.Length, ..density.., color = "raw density"), 
               size = 2, geom = "line") +
  geom_line(size = 1.5, aes(color = "bootstrapped")) + 
  scale_color_manual(values = c("red", "black")) +
  labs(y = "density") +
  theme(legend.position = c(0.5,0),
        legend.justification = c(0,0))

I included histogram and density layers for the original data as well for comparison. You can see that the median and raw densities are very close with 1000 bootstrap samples.

TechQA.

Add shaded standard error curves to geom_density in ggplot2

There are 1 answers

Generate samples

Compute within-sample density

Summarize densities into quantiles

Visualize and compare

Related Questions in R

Related Questions in GGPLOT2

Related Questions in PROBABILITY-DENSITY

Related Questions in STANDARD-ERROR

Popular Questions

Popular Tags

Trending Questions