I'm trying to partition my data into equal-sized bins, calculate the mean value of wdi_expedu for each bin, then overlay a horizontal segment of bin-specific means on a scatterplot of wdi_expedu and left_seats.
I used this code to add a variable to my dataset breaking it into 4 bins, then creating a new data-frame with the group means for each bin.
j <- 4
data_selected_na <- data_selected_na %>%
mutate(bin = as.integer(cut_number(wdi_expedu,j)))
bin_means <- data_selected_na %>%
group_by(bin) %>%
summarise(mean_wdi_expedu = mean(wdi_expedu))
I tried to use this code to create my scatterplot, but the segments go the whole width of the graph. How do I limit the x-values of the segments so they only span the width of the corresponding bin (i.e., all the x-values in bin 1 are spanned by the segment of the group mean for bin1 1?) I don't want to code the x-limit coordinates by hand, since I will eventually scale the j value up to 100.
ggplot(data_selected_na, aes(x = left_seats, y = wdi_expedu)) +
geom_point() +
geom_segment(data = bin_means, aes(x = min(data_selected_na$left_seats), xend = max(data_selected_na$left_seats), y = bin_means$mean_wdi_expedu, yend = bin_means$mean_wdi_expedu), color = "red") +
labs(title = "Scatterplot with Horizontal Lines for j = 4", x = "Left Seats", y = "Education Spending") +
theme_minimal()
The issue is that you use the
minandmaxvalue ofleft_seatsfor each of your segments. Instead, if you want a segment per bin you have to use the lower and upper boundary value of the bins. To this end stick with the default labels returned bycut(aside: If you want integer codes you could setlabels=FALSE), then extract the lower and upper boundary into separate columns using e.g.tidyr::separate_wider_regex.Finally, you computed the mean of
wdi_eduper bin ofwdi_edu. Instead, as you want to comparewdi_edubyleft_seatsyou probably want to bin byleft_seats.Using some fake random example data: