R-plot of conditional distribution. cdplot() doesn't seem to do it

1.8k views Asked by At

I have a data set of the following structure:

> data("household", package="HSAUR2")
> household[c(1,5,10,30,40),]
   housing food goods service gender total
1      820  114   183     154 female  1271
5      721   83   176     104 female  1084
10     845   64  1935     414 female  3258
30    1641  440  6471    2063   male 10615
40    1524  964  1739    1410   male  5637

The column "total" is the sum of the first four columns. It's a household's expenditure grouped in four categories.

Now, if I wanted a conditional density plot of gender vs. total expenditure, I can go:

cdplot(gender ~ total, data=household)

And I will get this image:

enter image description here

I'd like the same picture with "total" expenditure on the x-axis, but the conditional distribution over the four classes (housing, food, goods, service) on the y-axis. I can only think of a very dirty hack where I generate a factor, and, for the first data line, I repeat "housing" 820 times, then "food" for 114 times, etc.

There has to be an easier way, right?

1

There are 1 answers

0
Thomas On BEST ANSWER

As I said, you're using the wrong tool to obtain what you want. You're envisioning a plot that cannot be obtained directly from your data (see bottom).

Instead, you need to model your data. Specifically, you want to predict the expected portion of expenditures in each category as a function of total expenditures. Then, the plot you're envisioning is shows the fitted values of that model (i.e., predicted proportion of expenditure in any area) as a function of total expenditures. Here's some code that does that using loess curves. I plot the raw data and the fitted values, to show you what is going on.

# setup the data
data("household", package = "HSAUR2")
household$total <- rowSums(household[,1:4])
household <- within(household, {
    housing <- housing/total
    food <- food/total
    goods <- goods/total
    service <- service/total
})

# estimate loess curves
l_list <-
list(loess(housing ~ total, data = household),
     loess(food ~ total, data = household),
     loess(goods ~ total, data = household),
     loess(service ~ total, data = household))

# stack fitted curves on top of one another
ndat <- data.frame(total = seq(min(household$total), max(household$total), 100))
p <- lapply(l_list, predict, newdata = ndat)
for(i in 2:length(l_list))
    p[[i]] <- p[[i]] + p[[i-1]]

# plot
plot(NA, xlim=range(household$total), ylim = c(0,1), xlab='Total', ylab='Percent', las=1, xaxs='i')
# plot dots
with(household, points(total, housing, pch = 20, col = palette()[1]))
with(household, points(total, housing + food, pch = 20, col = palette()[2]))
with(household, points(total, housing + food + goods, pch = 20, col = palette()[3]))
with(household, points(total, housing + food + goods + service, pch = 20, col = palette()[4]))
# plot fitted lines
for(i in 1:length(p))
    lines(ndat$total, p[[i]], type = 'l', lwd = 2, col = palette()[i])

The result:

enter image description here

If you tried to create a plot like this based on the raw data, it would look somewhat strange, but maybe that's what you're going for:

plot(NA, xlim=range(household$total), ylim = c(0,1), xlab='Total', ylab='Percent', las=1, xaxs='i')
with(household, lines(total[order(total)], housing[order(total)], pch = 20, col = palette()[1]))
with(household, lines(total[order(total)], (housing + food)[order(total)], pch = 20, col = palette()[2]))
with(household, lines(total[order(total)], (housing + food + goods)[order(total)], pch = 20, col = palette()[3]))
with(household, lines(total[order(total)], (housing + food + goods + service)[order(total)], pch = 20, col = palette()[4]))

Result:

enter image description here