I have the following table in R which lists a person race, gender, age, and cholesterol test. age and cholesterol test are displayed as dummy variables. age can be categorized as low, medium, or high, while cholesterol tests can be categorized as low or high. I want to transform the age and cholesterol columns to be single columns where low is categorized as 1, medium is categorized as 2, and high is categorized as 3. Cholesterol test can be neigh low or high if a person never took one and should be N/A in the expected output. I want the solution to be dynamic so that if I have multiple columns in this format, the code would still work (i.e. there may be some new tests, which can be categorized as high, low, or medium as dummy variables).
How can I do this in R?
input:
race gender age.low_tm1 age.medium_tm1 age.high_tm1 chol_test.low_tm1 chol_test.high_tm1
<chr> <int> <int> <int> <int> <int> <int>
1 white 0 1 0 0 0 0
2 white 0 1 0 0 0 0
3 white 1 1 0 0 0 0
4 black 1 0 1 0 0 0
5 white 0 0 0 1 0 1
6 black 0 0 1 0 1 0
expected output:
race gender age chol_test
1 white 0 1 n/a
2 white 0 1 n/a
3 white 1 1 n/a
4 black 1 2 n/a
5 white 0 3 3
6 black 0 2 1
We could first define a custom function that allows us to recode dummy variables based on their variable names, below called
var_nm2value
.This function takes the values of the variables as
x
argument. Indplyr::across
this is the.x
part. And it takes a list of name-value pairs asvalue_ls
argument. The function just loops over the list of name-value pairs, checks if the name invalue_ls
is found in the variable name. To do this it usesgrepl
ondplyr::cur_column()
. If we have a match then we replace all1
s with the value from ourvalue_ls
and we return all other values, that is the zeros, as is.Then we can define a list of recode values, below
recode_ls
.Finally, we use
purrr::map_dfc
in adplyr::summarise
where we use the variable strings we want to create"age"
and"chol_test"
, then ii)select
only columns which contain this string, and in each iteration we iii) applydplyr::across
to recode the values, iv) pipe the result in ado.call
to get themax
and finally v) recode0
s toNA
:Created on 2022-01-03 by the reprex package (v0.3.0)