I am creating histograms of substitutions: 1st, 2nd,or 3rd sub over Time. So each histogram shows the number of subs in a given minute given the Sub Number. The histograms make sense to me because for the most part they are smooth (I used a bin width of 1 minute). Nothing looks too out of the ordinary. However, when I overlay a density plot, the tails on the left inflate and I cannot determine why for one of the graphs.

The dataset is of substitions, ranging from minute 1 to a maximum time. I then cut this dataset in half to only look at when the sub was made after minute 45. I have not folded this data back and I have tried to create a reproducable example, but cannot given the data.

Code used to create in R

## Filter out subs that are not in the second half
df.half<-df[df$PeriodId>=2,]
p<-ggplot(data=df.half, aes(x=time)) + 
  geom_histogram(aes(y=..density..),position="identity", alpha=0.5,binwidth=1)+
  geom_vline(data=sumy.df.half,aes(xintercept=grp.mean),color="blue", linetype="dashed", size=1)+
  geom_density(alpha=.2)+
  facet_grid(SUB_NUMBER ~ .)+
  scale_y_continuous(limits = c(0,0.075),breaks = c(seq(0,0.075,0.025)),
                     minor_breaks = c(seq(0,0.075,0.025)),name='Count')
p

Why, for the First Sub is the density plot inflated in the tail if there are no values less than 45? Also why isn't the density plot more inflated in the tail for the Second Sub?

Side Note: I did ask this question on crossvalidated, but was told since it involved R, to ask it here instead. Here

enter image description here

So I was able to change the code and get the following:

ggplot() + 
  geom_histogram(data=df.half, aes(x=time,y=..density..),position="identity", alpha=0.5,binwidth=1)+
  geom_density(data=df.half,aes(x=time,y=..density..))+
  geom_vline(data=sumy.df.half,aes(xintercept=grp.mean),color="blue", linetype="dashed", size=1)+
  facet_grid(SUB_NUMBER ~ .)

enter image description here

This looks more correct and at least now fits the dataset. However, I am still confused as to why those issues occured in the first place.

1 Answers

1
Community On Best Solutions

While there is no data sample to reproduce the error, you could try to make sure that the environment used by geom_density is correct by specifying it explicitly. You can also try to move the code line specifying the density (geom_density) just after the geom_histogram. Also, the y-axis label is probably wrong - it is now set as counts, while values suggest that is in fact density.

How would I specify density explicitly?

You can specify the density parameters explicitly by specifying data, aes and position directly in geom_density function call, so it would use these stated instead of inherited arguments:

ggplot() + 
  geom_histogram(data=df.half, aes(x=time,y=..density..),position="identity", alpha=0.5,binwidth=1)+
  geom_density(data=df.half,aes(x=time,y=..density..))+
  geom_vline(data=sumy.df.half,aes(xintercept=grp.mean),color="blue", linetype="dashed", size=1)+
  facet_grid(SUB_NUMBER ~ .)

I do not understand how it occured in the first place I think in your initial code for geom_density, you have explicitly specified just the alpha argument. Thus for all of the rest of the parameters it needed, (data, aes, position etc) it used the inherited arguments/parameters and apparently it did not inherit them correctly. Probably it tried to use the data argument from the geom_vline function - sumy.df.half , or was confused by the syntaxis in argument "..density.."