I am selecting a 90/10 Training/Test split with some data in R. After I have the Training set. I would like to standardize it. I would then like to use the same mean and standard deviation used in the training set and apply that standardization to the test set.
I would like to do this in the most base-R way possible but would be ok with a dplyr
solution too. Note that I have columns that are both factors/chr
and numeric
. Of course I need to select the numeric ones first.
My first setup is below with a reproducible example code. I have the means and standard deviations for the appropriate numeric columns, now how can I apply the standardization back to the specific columns on the training and test data?
library(tidyverse)
rm(list = ls())
x <- data.frame("hame" = c("Bob", "Roberta", "Brady", "Jen", "Omar", "Phillip", "Natalie", "Aaron", "Annie", "Jeff"),
"age" = c(60, 55, 25, 30, 35, 40, 47, 32, 34,67),
"income" = c(50000, 60000, 100000, 90000, 100000, 95000, 75000, 85000, 95000, 105000))
train_split_pct = 0.90
train_size <- ceiling(nrow(x)*train_split_pct) # num of rows for training set
test_size <- nrow(x) - train_size # num of rows for testing set
set.seed(123)
ix <- sample(1:nrow(x)) # shuffle
x_new = x[ix, ]
Train_set = x_new[1:train_size, ]
Test_set = x_new[(train_size+1):(train_size+test_size), ]
Train_mask <- Train_set %>% select_if(is.numeric)
Train_means <- Train_mask %>% apply(2, mean)
Train_stddevs <- Train_mask %>% apply(2, sd)
So after reviewing the prior answers which worked fine, I found them a bit unclear to use and not intuitive. I have achieved the desired result via a for loop. While slightly rudimentary I believe it a more clear approach. Given the use case where I don't have many columns I don't see a major issue in this solution unless there were many columns of data to go through. In that case I would need help seeking a faster solution.
Regardless, my method is as follows. I gather all column names in my
Train_mask
which is only the numeric columns. Next, I loop through each of the names and update the values accordingly with the standardization from their respectiveTrain_means
andTrain_stddevs
.Due to the way I construct my Training and Testing sets there should be no issues with the order of my column frames and they can be used sequentially in the following fashion.
Output: