How do I generalize this data in rate per 10,000 to get an estimate of which state has maximum occurrence of 'YES'

96 views Asked by At

I have an ordinal dataset for different states. Since every state has different populations and participants. How to generalize them in a proper rate to identify maximum occurrence per state. For Example, people who smoke/10,000 population in R

            sitecode   AK   AL   AR  AZB   CA   CO   DE   FL   HI   IA   IL   KS   KY   MI   MO   MS   MT   NE   NJ   NV   NY   OK   PA   RI   SC   SD   TN   WI   WV   WY
class sex                                                                                                                                                                  
No    female          1591 2324 3948 1784  555  885 6853 2363  383 1633 3787 1840 2305 9653 3373 4215 8966 2738  655 1705 6204 3951 1602 3448 2745  527 3858  767 4092 6414
      male            1425 1951 3243 1630  544  847 6117 2115  338 1419 3193 1712 2013 8548 3030 3300 8143 2424  529 1505 5513 3643 1517 2880 2031  459 3436  654 3624 5774
Yes   female           146  223  508  304   68   83  615  297   63  155  394  154  233  988  324  405  903  281   66  219  650  381  148  320  304   51  410   87  437  704
      male              73   98  167   92   22   37  255   92   18   52  175   74  121  444  136  140  377   98   29   78  218  166   49  116  101   11  147   22  179  309
1

There are 1 answers

0
Jon Spring On

I would approach this by reshaping this data into long form (eg 3 columns: class, sex, state; and ~400 rows, for each combination), then join to a separate table with state populations (preferably including the same state abbreviations), and then normalize each number by state population.

Here's an example.

library(tidyverse)
df %>%
  gather(state, count, AK:WY) %>%   # easier to work with if we gather into long form
  left_join(                        # left_join will pull in matching data for each row
    state.x77 %>%                   #   ...using data included in R's base datasets,
      as.data.frame() %>%           #     updated in this case to just show state abbrev
      rownames_to_column() %>%      #     and a population number
      mutate(state = state.abb) %>%
      select(state, Population)
  ) %>%
  mutate(rate_10k = count / (Population * 10)) %>%    # Find rate per 10k
  filter(class == "Yes") %>%  # Only include the "Yes" answers

  ggplot(aes(state %>% fct_reorder(rate_10k), 
             rate_10k, 
             color = sex)) + 
  geom_point() +
  coord_flip() +
  labs(x = "Rate per 10,000 population",
       y = "State")

enter image description here