data_2['col1'] = np.where((df1.year.astype(int) == 2021) & (df1.col1_y.notna()), df1.col1_y, data_2.col1)
This is my original working code in Gen1, but it isgiving following error in gen2.
PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
I tried adding .to_numpy() but I got another error which is below.
data_2['col1'] = np.where((df1.year.to_numpy().astype(int) == 2021) & (df1.col1_y.notna()), df1.col1_y, data_2.col1)
AttributeError: 'numpy.ndarray' object has no attribute '_internal'
I could not understand why is it looking for _internal. Could someone help in resolution of this error?
I tried
data_2['col1'] = np.where((df1.year.astype(int) == 2021) & (df1.col1_y.notna()), df1.col1_y, data_2.col1)
this code first and then
data_2['col1'] = np.where((df1.year.to_numpy().astype(int) == 2021) & (df1.col1_y.notna()), df1.col1_y, data_2.col1)
this code.
The value in column col1 should be as per condition mentioned in code, but instead I am getting error.
Convert pyspark-pandas to pandas dataframe and execute your code, then it will work successfully.