Error when calculating correlations in pyspark

30 views Asked by At

I have a two part question: I am trying to calculate the correlations with a large list of variables to get the top ten variables to use for a model. Some of the variables are character/categorical, others are numerical. I will give an example of my data below. My data is pyspark and spark2.

  1. How do I achieve this? in other words, converting the character variables to dummy or other, and then calculate the correlations, to get the list of top ten highest correlated variables.
  2. I got stuck half way, only considering the numeric ones for now, as I am getting this error: 'NoneType' object has no attribute 'setCallSite'

My Data:

   target_var | col1 | col2 | col3 | col 4 | ... |col350
        1     |   Y  |   0  |   2  |   C   | ... |   X
        0     |   N  |   4  |   2  |   D   | ... |   3
        0     |   N  |   3  |   2  |   A   | ... |   U
        0     |   Y  |   3  |   5  |   A   | ... |   5
        1     |   N  |   1  |   5  |   A   | ... |   6
        1     |   Y  |   4  |   X  |   C   | ... |   5
        1     |   Y  |   2  |   X  |   D   | ... |   0
        0     |   Y  |   0  |   6  |   C   | ... |   1
        0     |   N  |   0  |   2  |   D   | ... |   4
        0     |   N  |   1  |   1  |   C   | ... |   X

Code so far:

selected_columns = data_to_analyse.drop(target_column).columns

# Assemble the columns into a feature vector
assembler = VectorAssembler(inputCols=selected_columns, outputCol='features')
assembled_df = assembler.transform(data_to_analyse)

# Select the target column to calculate correlations
target_column = 'target_var '
data = assembled_df.select(col(target_column).alias('label'), 'features')

# Calculate the correlation matrix
correlation_matrix = Correlation.corr(data, 'features').collect()[0][0]

This is where I get the error, and get stuck.

I'd appreciate the help.

Thank you.

0

There are 0 answers