Create a Model for Dummy Variables

949 views Asked by At

Starting with a training data set for a variable var1 as:

var1
A
B
C
D

I want to create a model (let's call it dummy_model1) that would then transform the training data set to:

var1_A  var1_B  var1_C  var1_D
1       0       0       0
0       1       0       0
0       0       1       0
0       0       0       1

This functionality (or similar) exists in, among others, the dummies package in R and get_dummies in Pandas, or even case statements in SQL.

I'd like to then be able to apply dummy_model1 to a new data set:

var1
C
7
#
A

and get the following output:

var1_A  var1_B  var1_C  var1_D
0       0       1       0
0       0       0       0
0       0       0       0
1       0       0       0

I know I can do this in SQL with 'case' statements but would love to automate the process given I have ~2,000 variables. Also, the new data sets will almost always have "bad" data (e.g., 7 and # in the above example).

Somewhat language agnostic (as long as its open source) but would prefer Python or R. Please note the data is over 500GB so that limits some of my options. Thanks in advance.

2

There are 2 answers

2
Nick Becker On BEST ANSWER

Assuming var1 fits in memory on its own, here is a possible solution:

First, read in var1.

Next, use get_dummies to get all the "training" categories encoded as dummy variables. Store the column names as a list or an array.

Then, read in the first few rows of your training dataset to get the column names and store them as a list (or if you know these already you can skip this step).

Create a new list or array containing the dummy variable column names and the relevant other columns (this could just be every column in the dataset except var1). This will be the final columns encoding.

Then, read in your test data. Use get_dummies to encode var1 in your test data, knowing it may be missing categories or have extraneous categories. Then reindex the data to match the final columns encoding.

After reindexing, you will end up a test dataset with var1 dummies consistent with your training var1.

To illustrate:

import pandas as pd
import numpy as np

training = pd.DataFrame({
        'var1': ['a','b','c'],
        'other_var':[4,7,3],
        'yet_another':[8,0,2]
    })

print training
   other_var var1  yet_another
0          4    a            8
1          7    b            0
2          3    c            2

test = pd.DataFrame({
        'var1': ['a','b','q'],
        'other_var':[9,4,2],
        'yet_another':[9,1,5]
    })

print test
   other_var var1  yet_another
0          9    a            9
1          4    b            1
2          2    q            5


var1_dummied = pd.get_dummies(training.var1, prefix='var1')
var_dummy_columns =  var1_dummied.columns.values

print var_dummy_columns
array(['var1_a', 'var1_b', 'var1_c'], dtype=object)

final_encoding_columns = np.append(training.drop(['var1'], axis = 1).columns, var_dummy_columns)

print final_encoding_columns
array(['other_var', 'yet_another', 'var1_a', 'var1_b', 'var1_c'], dtype=object)


test_encoded = pd.get_dummies(test, columns=['var1'])

print test_encoded
   other_var  yet_another  var1_a  var1_b  var1_q
0          9            9       1       0       0
1          4            1       0       1       0
2          2            5       0       0       1

test_encoded_reindexed = test_encoded.reindex(columns = final_encoding_columns, fill_value=0)

print test_encoded_reindexed
   other_var  yet_another  var1_a  var1_b  var1_c
0          9            9       1       0       0
1          4            1       0       1       0
2          2            5       0       0       0

This should be what you want, based on the expected output in your question and the comments.

If the test data easily fits in memory, you can easily extend this to multiple variables. Just save and then update final_encoding_columns iteratively for each training variable you want to encode. Then pass all of those columns to the columns= argument when reindexing the test data. Reindex with your complete final_encoding_columns and you should be all set.

2
joel.wilson On

just a try:

# first set the variable to factor with levels specified
df$var1 <- factor(df$var1, levels = LETTERS[1:4])

model.matrix(data = df, ~var1-1)
#  var1A var1B var1C var1D
#1     0     0     1     0
#4     1     0     0     0

# or even 
sapply(LETTERS[1:4], function(x) as.numeric(x==df$var1))
#     A B C D
#[1,] 0 0 1 0
#[2,] 0 0 0 0
#[3,] 0 0 0 0
#[4,] 1 0 0 0