I have a two part question: I am trying to calculate the correlations with a large list of variables to get the top ten variables to use for a model. Some of the variables are character/categorical, others are numerical. I will give an example of my data below. My data is pyspark and spark2.
- How do I achieve this? in other words, converting the character variables to dummy or other, and then calculate the correlations, to get the list of top ten highest correlated variables.
- I got stuck half way, only considering the numeric ones for now, as I am getting this error:
'NoneType' object has no attribute 'setCallSite'
My Data:
target_var | col1 | col2 | col3 | col 4 | ... |col350
1 | Y | 0 | 2 | C | ... | X
0 | N | 4 | 2 | D | ... | 3
0 | N | 3 | 2 | A | ... | U
0 | Y | 3 | 5 | A | ... | 5
1 | N | 1 | 5 | A | ... | 6
1 | Y | 4 | X | C | ... | 5
1 | Y | 2 | X | D | ... | 0
0 | Y | 0 | 6 | C | ... | 1
0 | N | 0 | 2 | D | ... | 4
0 | N | 1 | 1 | C | ... | X
Code so far:
selected_columns = data_to_analyse.drop(target_column).columns
# Assemble the columns into a feature vector
assembler = VectorAssembler(inputCols=selected_columns, outputCol='features')
assembled_df = assembler.transform(data_to_analyse)
# Select the target column to calculate correlations
target_column = 'target_var '
data = assembled_df.select(col(target_column).alias('label'), 'features')
# Calculate the correlation matrix
correlation_matrix = Correlation.corr(data, 'features').collect()[0][0]
This is where I get the error, and get stuck.
I'd appreciate the help.
Thank you.