Clear and Concise Way to apply Standardization to both Train and Test Set in R

1.4k views Asked by At

I am selecting a 90/10 Training/Test split with some data in R. After I have the Training set. I would like to standardize it. I would then like to use the same mean and standard deviation used in the training set and apply that standardization to the test set.

I would like to do this in the most base-R way possible but would be ok with a dplyr solution too. Note that I have columns that are both factors/chr and numeric. Of course I need to select the numeric ones first.

My first setup is below with a reproducible example code. I have the means and standard deviations for the appropriate numeric columns, now how can I apply the standardization back to the specific columns on the training and test data?

library(tidyverse)
rm(list = ls())
x <- data.frame("hame" =  c("Bob", "Roberta", "Brady", "Jen", "Omar", "Phillip", "Natalie", "Aaron", "Annie", "Jeff"),
                "age" = c(60, 55, 25, 30, 35, 40, 47, 32, 34,67),
                "income" = c(50000, 60000, 100000, 90000, 100000, 95000, 75000, 85000, 95000, 105000))

train_split_pct = 0.90

train_size <- ceiling(nrow(x)*train_split_pct)  # num of rows for training set
test_size <- nrow(x) - train_size               # num of rows for testing set 

set.seed(123)
ix <-  sample(1:nrow(x)) # shuffle
x_new = x[ix, ]
Train_set  = x_new[1:train_size, ]
Test_set   = x_new[(train_size+1):(train_size+test_size), ]

Train_mask <- Train_set %>% select_if(is.numeric) 
Train_means <- Train_mask %>% apply(2, mean)
Train_stddevs <- Train_mask %>% apply(2, sd)

3

There are 3 answers

0
Coldchain9 On

So after reviewing the prior answers which worked fine, I found them a bit unclear to use and not intuitive. I have achieved the desired result via a for loop. While slightly rudimentary I believe it a more clear approach. Given the use case where I don't have many columns I don't see a major issue in this solution unless there were many columns of data to go through. In that case I would need help seeking a faster solution.

Regardless, my method is as follows. I gather all column names in my Train_mask which is only the numeric columns. Next, I loop through each of the names and update the values accordingly with the standardization from their respective Train_means and Train_stddevs.

Due to the way I construct my Training and Testing sets there should be no issues with the order of my column frames and they can be used sequentially in the following fashion.

library(tidyverse)
rm(list = ls())
x <- data.frame("name" =  c("Bob", "Roberta", "Brady", "Jen", "Omar", "Phillip", "Natalie", "Aaron", "Annie", "Jeff"),
                "age" = c(60, 55, 25, 30, 35, 40, 47, 32, 34,67),
                "income" = c(50000, 60000, 100000, 90000, 100000, 95000, 75000, 85000, 95000, 105000))

train_split_pct = 0.90

train_size <- ceiling(nrow(x)*train_split_pct)  # num of rows for training set
test_size <- nrow(x) - train_size               # num of rows for testing set 

set.seed(123)
ix <-  sample(1:nrow(x)) # shuffle
x_new = x[ix, ]
Train_set  = x_new[1:train_size, ]
Test_set   = x_new[(train_size+1):(train_size+test_size), ]

Train_mask <- Train_set %>% select_if(is.numeric) 
Train_means <- data.frame(as.list(Train_mask %>% apply(2, mean)))
Train_stddevs <- data.frame(as.list(Train_mask %>% apply(2, sd)))


col_names <- names(Train_mask)
for (i in 1:ncol(Train_mask)){
  Train_set[,col_names[i]] <- (Train_set[,col_names[i]] - Train_means[,col_names[i]])/Train_stddevs[,col_names[i]]
  Test_set[,col_names[i]] <-  (Test_set[,col_names[i]] - Train_means[,col_names[i]])/Train_stddevs[,col_names[i]]
}

Train_set
Test_set

Output:

> Train_set
      name       age     income
3    Brady -3.180620  0.7745967
10    Jeff -2.972814  1.0327956
2  Roberta -3.032187 -1.2909944
8    Aaron -3.145986  0.0000000
6  Phillip -3.106404  0.5163978
9    Annie -3.136090  0.5163978
1      Bob -3.007448 -1.8073922
7  Natalie -3.071769 -0.5163978
5     Omar -3.131143  0.7745967
> Test_set
  name        age    income
4  Jen -0.9769502 0.2581989
2
akrun On

We can do this in a concise way. Get the mean, sd of the 'Train' dataset ('mean_sd'). Note that with dplyr version >= 1.0, summarise can return more than one row. So, make use of that feature to create a two row dataset - first row => mean, second row => sd

library(dplyr) # >= 1.0.0    
library(purrr)
mean_sd <- Train_set %>%
    summarise(across(where(is.numeric),  ~ c(mean(., na.rm = TRUE), 
            sd(., na.rm = TRUE))))

Then, create a function ('f1') to do the standardization.

f1 <- function(x, y) (x -y[1])/y[2]

Loop over the list of 'Train', 'Test' dataset, use map2 to loop over the corresponding columns based on the 'mean_sd' dataset, apply the f1 and assign that output to the columns. Then, with list2env, we can update the same objects in the global environment

list2env(map(lst(Train_set, Test_set), ~  {
   .x[names(mean_sd)] <- map2(select(.x, names(mean_sd)), mean_sd, f1)
         .x}), .GlobalEnv)

-output

Train_set
#   hame        age     income
#3    Brady -1.3286522  0.7745967
#10    Jeff  1.6256451  1.0327956
#2  Roberta  0.7815601 -1.2909944
#8    Aaron -0.8362693  0.0000000
#6  Phillip -0.2735460  0.5163978
#9    Annie -0.6955885  0.5163978
#1      Bob  1.1332622 -1.8073922
#7  Natalie  0.2188368 -0.5163978
#5     Omar -0.6252481  0.7745967


Test_set
# hame        age    income
#4  Jen -0.9769502 0.2581989
4
Duck On

Consider this as an option. You can use scale() function that allows you to normalize your variables. At the end you can find the code. Also, you can use mutate_if() in order to choose the numeric variables and avoid creating other dataframes. Here the code using dplyr where I have created two new dataframes with the required values:

library(tidyverse)
rm(list = ls())
x <- data.frame("hame" =  c("Bob", "Roberta", "Brady", "Jen", "Omar", "Phillip", "Natalie", "Aaron", "Annie", "Jeff"),
                "age" = c(60, 55, 25, 30, 35, 40, 47, 32, 34,67),
                "income" = c(50000, 60000, 100000, 90000, 100000, 95000, 75000, 85000, 95000, 105000))

train_split_pct = 0.90

train.size <- ceiling(nrow(x)*train_split_pct)  # num of rows for training set
test.size <- nrow(x) - train.size               # num of rows for testing set 

set.seed(123)
ix <-  sample(1:nrow(x)) # shuffle
x_new = x[ix, ]
Train.set  = x_new[1:train.size, ]
Test.set   = x_new[(train.size+1):(train.size+test.size), ]
#Normalize
Train.set2 <- Train.set %>%
  mutate_if(is.numeric, scale)
Test.set2 <- Test.set %>%
  mutate_if(is.numeric, scale)

Update: If the scale() is not working, you can try reshaping the data and joining with the computed values for mean and SD:

#Define indexes for numeric vars
index.train <- which(names(Train.set)%in% names(Train_means))
#Format means and sd to merge
Train2 <- Train.set %>% 
  mutate(id=row_number()) %>%
  pivot_longer(cols=index.train) %>%
  left_join(
    Train_means %>% t() %>%data.frame %>%
      pivot_longer(everything()) %>%
      rename(Mean=value) %>%
      left_join(Train_stddevs %>% t() %>%data.frame %>%
                  pivot_longer(everything()) %>%
                  rename(SD=value))
  ) %>%
  #Compute standard values
  mutate(SValue=(value-Mean)/SD) %>%
  select(-c(value,Mean,SD)) %>%
  pivot_wider(names_from = name,values_from=SValue) %>% select(-id)

Output:

# A tibble: 9 x 3
  hame       age income
  <fct>    <dbl>  <dbl>
1 Brady   -1.33   0.775
2 Jeff     1.63   1.03 
3 Roberta  0.782 -1.29 
4 Aaron   -0.836  0    
5 Phillip -0.274  0.516
6 Annie   -0.696  0.516
7 Bob      1.13  -1.81 
8 Natalie  0.219 -0.516
9 Omar    -0.625  0.775

And for the test set, the process is similar:

#Define indexes
index.test <- which(names(Test.set)%in% names(Train_means))
#Format means and sd 2
Test2 <- Test.set %>% 
  mutate(id=row_number()) %>%
  pivot_longer(cols=index.test) %>%
  left_join(
    Train_means %>% t() %>%data.frame %>%
      pivot_longer(everything()) %>%
      rename(Mean=value) %>%
      left_join(Train_stddevs %>% t() %>%data.frame %>%
                  pivot_longer(everything()) %>%
                  rename(SD=value))
  ) %>%
  #Compute standard values
  mutate(SValue=(value-Mean)/SD) %>%
  select(-c(value,Mean,SD)) %>%
  pivot_wider(names_from = name,values_from=SValue) %>% select(-id)

Output:

# A tibble: 1 x 3
  hame     age income
  <fct>  <dbl>  <dbl>
1 Jen   -0.977  0.258

The key is merging the values after reshaping. As evidence I will show the intermediate step for the final dataset. It looks like this:

# A tibble: 2 x 7
  hame     id name   value    Mean      SD SValue
  <fct> <int> <chr>  <dbl>   <dbl>   <dbl>  <dbl>
1 Jen       1 age       30    43.9    14.2 -0.977
2 Jen       1 income 90000 85000   19365.   0.258

In that way is easy to compute the standard values you want.