Data table dcast column headings

470 views Asked by At

I have a data table of the form

ID  REGION  INCOME_BAND RESIDENCY_YEARS
1   SW  Under 5,000 10-15
2   Wales   Over 70,000 1-5
3   Center  15,000-19,999   6-9
4   SE  15,000-19,999   15-19
5   North   15,000-19,999   10-15
6   North   15,000-19,999   6-9

created by

exp = data.table(
  ID = c(1,2,3,4,5,6),
  REGION=c("SW", "Wales", "Center", "SE", "North", "North"),
  INCOME_BAND = c("Under ?5,000", "Over ?70,000", "?15,000-?19,999", "?15,000-?19,999", "?15,000-?19,999","?15,000-?19,999"),
  RESIDENCY_YEARS = c("10-15","1-5","6-9","15-19","10-15", "6-9"))

I would like to transform this to

Example of the result of any data table manipulation

I've managed to perform the majority of the work with dcast:

exp.dcast = dcast(exp,ID~REGION+INCOME_BAND+RESIDENCY_YEARS, fun=length,
  value.var=c('REGION', 'INCOME_BAND', 'RESIDENCY_YEARS'))

However I need some help creating sensible column headings. Currently I have

["ID"
"REGION.1_Center_?15,000-?19,999_6-9"
"REGION.1_North_?15,000-?19,999_10-15"
"REGION.1_North_?15,000-?19,999_6-9"
"REGION.1_SE_?15,000-?19,999_15-19" "REGION.1_SW_Under ?5,000_10-15" "REGION.1_Wales_Over ?70,000_1-5"
"INCOME_BAND.1_Center_?15,000-?19,999_6-9"
"INCOME_BAND.1_North_?15,000-?19,999_10-15"
"INCOME_BAND.1_North_?15,000-?19,999_6-9"
"INCOME_BAND.1_SE_?15,000-?19,999_15-19"
"INCOME_BAND.1_SW_Under ?5,000_10-15"
"INCOME_BAND.1_Wales_Over ?70,000_1-5"
"RESIDENCY_YEARS.1_Center_?15,000-?19,999_6-9" "RESIDENCY_YEARS.1_North_?15,000-?19,999_10-15" "RESIDENCY_YEARS.1_North_?15,000-?19,999_6-9"
"RESIDENCY_YEARS.1_SE_?15,000-?19,999_15-19"
"RESIDENCY_YEARS.1_SW_Under ?5,000_10-15"
"RESIDENCY_YEARS.1_Wales_Over ?70,000_1-5"

And I would like the column headings to be

ID  SW  Wales   Center  SE  North   Under 5,000 Over 70,000 15,000-19,999   1-5 6-9 10-15   15-19

Could anybody advise?

1

There are 1 answers

0
Uwe On

This apparently simple question is not easy to answer. So, we will go forward step-by step.

First, the OP has tried to reshape multiple value columns simultaneously which creates an unwanted cross product of all available combinations.

In order to treat all values in the same way, we need to melt() all value columns first before reshaping:

melt(exp, id.vars = "ID")[, dcast(.SD, ID ~ value, length)]
   ID 1-5 10-15 15-19 6-9 ?15,000-?19,999 Center North Over ?70,000 SE SW Under ?5,000 Wales
1:  1   0     1     0   0               0      0     0            0  0  1            1     0
2:  2   1     0     0   0               0      0     0            1  0  0            0     1
3:  3   0     0     0   1               1      1     0            0  0  0            0     0
4:  4   0     0     1   0               1      0     0            0  1  0            0     0
5:  5   0     1     0   0               1      0     1            0  0  0            0     0
6:  6   0     0     0   1               1      0     1            0  0  0            0     0

Now, the result has 13 columns instead of 19 and the columns are named by the respective value as requested.

Unfortunately, the columns appear in the wrong order because they alphabetically ordered. There are two approaches to change the order:

Change order of columns after reshaping

The setcolorder() function reorders the columns of a data.table in place, e.g. without copying:

# define column order = order of values
col_order <- c("North", "Wales", "Center", "SW", "SE", "Under ?5,000", "?15,000-?19,999", "Over ?70,000", "1-5", "6-9", "10-15", "15-19")

melt(exp, id.vars = "ID")[, dcast(.SD, ID ~ value, length)][
  # reorder columns
  , setcolorder(.SD, c("ID", col_order))]
   ID North Wales Center SW SE Under ?5,000 ?15,000-?19,999 Over ?70,000 1-5 6-9 10-15 15-19
1:  1     0     0      0  1  0            1               0            0   0   0     1     0
2:  2     0     1      0  0  0            0               0            1   1   0     0     0
3:  3     0     0      1  0  0            0               1            0   0   1     0     0
4:  4     0     0      0  0  1            0               1            0   0   0     0     1
5:  5     1     0      0  0  0            0               1            0   0   0     1     0
6:  6     1     0      0  0  0            0               1            0   0   1     0     0

Now, all REGION columns appear first, followed by INCOME_BAND and RESIDENCY_YEARS columns in the specified order.

Set factor levels before reshaping

If value is turned into a factor with appropriately ordered factor levels dcast() will use the factor levels for ordering the columns:

melt(exp, id.vars = "ID")[, value := factor(value, col_order)][
  , dcast(.SD, ID ~ value, length)]
   ID North Wales Center SW SE Under ?5,000 ?15,000-?19,999 Over ?70,000 1-5 6-9 10-15 15-19
1:  1     0     0      0  1  0            1               0            0   0   0     1     0
2:  2     0     1      0  0  0            0               0            1   1   0     0     0
3:  3     0     0      1  0  0            0               1            0   0   1     0     0
4:  4     0     0      0  0  1            0               1            0   0   0     0     1
5:  5     1     0      0  0  0            0               1            0   0   0     1     0
6:  6     1     0      0  0  0            0               1            0   0   1     0     0

Set factor levels before reshaping - lazy version

If it is sufficient to have the columns grouped by REGION, INCOME_BAND, and RESIDENCY_YEARS then we can use a short cut to avoid specifying each value in col_order. The fct_inorder() function from the forcats package reorders factor levels by their first appearance in a vector:

melt(exp, id.vars = "ID")[, value := factor(value, col_order)][
  , dcast(.SD, ID ~ value, length)]
   ID SW Wales Center SE North Under ?5,000 Over ?70,000 ?15,000-?19,999 10-15 1-5 6-9 15-19
1:  1  1     0      0  0     0            1            0               0     1   0   0     0
2:  2  0     1      0  0     0            0            1               0     0   1   0     0
3:  3  0     0      1  0     0            0            0               1     0   0   1     0
4:  4  0     0      0  1     0            0            0               1     0   0   0     1
5:  5  0     0      0  0     1            0            0               1     1   0   0     0
6:  6  0     0      0  0     1            0            0               1     0   0   1     0

This works because the output of melt() is ordered by variable:

melt(exp, id.vars = "ID")
    ID        variable           value
 1:  1          REGION              SW
 2:  2          REGION           Wales
 3:  3          REGION          Center
 4:  4          REGION              SE
 5:  5          REGION           North
 6:  6          REGION           North
 7:  1     INCOME_BAND    Under ?5,000
 8:  2     INCOME_BAND    Over ?70,000
 9:  3     INCOME_BAND ?15,000-?19,999
10:  4     INCOME_BAND ?15,000-?19,999
11:  5     INCOME_BAND ?15,000-?19,999
12:  6     INCOME_BAND ?15,000-?19,999
13:  1 RESIDENCY_YEARS           10-15
14:  2 RESIDENCY_YEARS             1-5
15:  3 RESIDENCY_YEARS             6-9
16:  4 RESIDENCY_YEARS           15-19
17:  5 RESIDENCY_YEARS           10-15
18:  6 RESIDENCY_YEARS             6-9