I have a large dataset with questionnaire data at multiple time points (waves). The questionnaire was identical at each point, so variables are labeled by time in the form "w#variablename" (e.g., "w1age", "w2age", "w3age").

I split the larger file into data frames by each time point, so I would now like to remove the "w#" from the column name for each column.

Basically, I would like to use R to "find and replace" to delete any column with "w1".

I split the data as follows:

w1 = Data %>% select(matches("w1"))
w2 = Data %>% select(matches("w2"))
w3 = Data %>% select(matches("w3"))
w4 = Data %>% select(matches("w4"))

Now for each of these 4 data sets, I would like to remove the respective "w#" from column names.

Thank you!

2 Answers

Tim Biegeleisen On

We should be able to use sub here:

names(Data) <- sub("^w\\d+", "", names(Data))

The regex pattern ^w\\d+ matches, at the start of each column name, w, followed by one or more digits. We then replace this with empty string, effectively removing this prefix from matching column names.

akrun On

An option with tidyverse would be rename_at. Specify only the column names that needs to be changes with matches and with str_remove remove the substring "w" followed by one or more digits

Data %>% 
   rename_at(vars(matches("^w\\d+")), ~ str_remove(., "^w\\d+"))

NOTE: If the column names are already w1age, w2age ... w100age and when we remove the 'w' followed by digits, all the columns would have the same column name which is discouraged). So, probably, we may need to wrap with make.unique to make the column names unique