I am new in Pyspark. I am using pyspark.pandas and want to test how it can be used with scipy library. I've got a really basic code of scipy and pandas:
import pandas as pd
from scipy.stats import pearsonr
# Example data
data = {
'Age': [23, 45, 34, 65, 34, 29, 40],
'Day_of_Week': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
}
df = pd.DataFrame(data)
# Assigning numerical values for the days of the week
days_of_week = {'Monday': 1, 'Tuesday': 2, 'Wednesday': 3, 'Thursday': 4, 'Friday': 5, 'Saturday': 6, 'Sunday': 7}
df['Day_of_Week_num'] = df['Day_of_Week'].map(days_of_week)
# Calculating Pearson correlation
corr, _ = pearsonr(df['Age'], df['Day_of_Week_num'])
print('Pearson correlation coefficient:', corr)
When I changed "pandas" to "pyspark.pandas", I received an error.
PandasNotImplementedError: The method pd.Series.__iter__() is not implemented. If you want to collect your data as an NumPy array.
I encountered an issue after changing "pandas" to "pyspark.pandas". I came across a similar problem in this PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array. However, the person applied the function over the entire dataframe, and not on the pyspark.pandas.Series. As far as I know, pd.Series.__iter__()
is not implemented because it requires collecting the data to a single node, (such as iterating over a Series) . Therefore, I'm not sure if it is possible to implement scipy.stats.pearsonr on pyspark.pandas.Series.