Starting with a training data set for a variable var1
as:
var1
A
B
C
D
I want to create a model (let's call it dummy_model1
) that would then transform the training data set to:
var1_A var1_B var1_C var1_D
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
This functionality (or similar) exists in, among others, the dummies
package in R
and get_dummies
in Pandas
, or even case
statements in SQL
.
I'd like to then be able to apply dummy_model1
to a new data set:
var1
C
7
#
A
and get the following output:
var1_A var1_B var1_C var1_D
0 0 1 0
0 0 0 0
0 0 0 0
1 0 0 0
I know I can do this in SQL
with 'case' statements but would love to automate the process given I have ~2,000 variables. Also, the new data sets will almost always have "bad" data (e.g., 7
and #
in the above example).
Somewhat language agnostic (as long as its open source) but would prefer Python
or R
. Please note the data is over 500GB so that limits some of my options. Thanks in advance.
Assuming
var1
fits in memory on its own, here is a possible solution:First, read in
var1
.Next, use
get_dummies
to get all the "training" categories encoded as dummy variables. Store the column names as a list or an array.Then, read in the first few rows of your training dataset to get the column names and store them as a list (or if you know these already you can skip this step).
Create a new list or array containing the dummy variable column names and the relevant other columns (this could just be every column in the dataset except
var1
). This will be the final columns encoding.Then, read in your test data. Use
get_dummies
to encodevar1
in your test data, knowing it may be missing categories or have extraneous categories. Then reindex the data to match the final columns encoding.After reindexing, you will end up a test dataset with
var1
dummies consistent with your trainingvar1
.To illustrate:
This should be what you want, based on the expected output in your question and the comments.
If the test data easily fits in memory, you can easily extend this to multiple variables. Just save and then update
final_encoding_columns
iteratively for each training variable you want to encode. Then pass all of those columns to thecolumns=
argument when reindexing the test data. Reindex with your completefinal_encoding_columns
and you should be all set.