Interpreting y axis in density plot

406 views Asked by At

200 people were tested, 20 of those were infected. I want to get a posterior distribution for the uncertainty associated with the probability that a person is infected.

I do this like this:

n<-200
s<-20
p<-seq(0,0.3,0.001)

dp<-dbeta(p, s+1, n-s+1)

But then when I plot it, I don't know how to interpret the y axis and summary results:

plot(p, dp, type="l")

enter image description here

> summary(dp)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
 0.000000  0.000011  0.032438  3.322259  3.841204 18.820899 

So there is a 10% chance of.....something being 18.82? Or? What does this summary tell me?

Also, what is the difference between the first plot and the plot below?

plot(density(dp))

enter image description here

1

There are 1 answers

0
Markus_J On

So was basically answering the plot difference question and as pointed out not very precise. So lets try again:

First: The Result of Plot 1

A beta-distribution is bound between 0 and 1. The shape is influenced by a (in our example: 21) and b (in our example: 181). Given that shape, dbeta(x, shape1, shape2) is used to return probability density at a value x bounded between 0 and 1. For example dbeta returns 18.8209 (the maximum) for the value 0.1. Now given the data, we should expect that the most likely belief on the number of infected patients is 20/200 or 10 % of the sample. However, there is, for example, a moderate chance that the real percentage is 0.075 while it is very unlikely to be 0.3. As a result the beta-distribution puts mass around the x-value of 0.1 while it is practically 0 at a x-value of 0.9.

Second: The interpretation of the y-axis

Now best to ask a statisticion here (which i am not). However, the y-axis here describes the probability density. Now as a heuristic think of the probability density of giving weight to the values on the x-axis. So if you say draw from a uniform distribution every x-value is equally likely to be drawn (hence considered a uninformative prior). While if you would draw from a normal distribution x-values close to the mean are more likely to be drawn than x-values 5 standard deviations away from the mean. The area under the distribution curve, however, is 1 in both cases. (Hope this gives the picture; Thought this might help better than formulas).

Third: The results

So if you want to assess the distribution like you intended with the summary() command do the following: Use qbeta(0.5, 21, 181). This will provide you with the 50 % quantile; Which in our case is 0.1026. Thus, we can say that 50 % of the distribution is to the right of 0.1026 (on the x-axis). So there is a 50 % chance that the percentage of infected individuals within 200 randomly selected people is lager than 10.26 (20.52 people).

Lastly:

The density command tries to estimate the density function of given data. Kinda like making a histogram with very small breaks and drawing a curve based on the bar lengths (yes i know). Beyond that it also smoothes out and extends the boundaries of your data to approach zero. However, you are doing a density on a density here. So...