Below is the structure of my data frame. I need to group by based on id, country and state and aggregate the vectors_1 & vector_2 respectively. The restrictions is I need to use pyspark 2.3.

    Id  Country State    Vector_1                   Vector_2
    1     US     IL   [1.0,2.0,3.0,4.0,5.0]   [5.0,5.0,5.0,5.0,5.0]

    1     US     IL   [5.0,3.0,3.0,2.0,1.0]   [5.0,5.0,5.0,5.0,5.0]

    2     US     TX   [6.0,7.0,8.0,9.0,1.0]   [1.0,1.0,1.0,1.0,1.0]

The output should looks like this

    Id  Country State    Vector_1                      Vector_2
    1     US     IL   [6.0,5.0,6.0,6.0,6.0]    [10.0,10.0,10.0,10.0,10.0] 
    2     US     TX    [6.0,7.0,8.0,9.0,1.0]    [1.0,1.0,1.0,1.0,1.0]

Here is the code. But I'm looking for some other approach using lambda. Pls advise.

from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql import Window    
from pyspark.sql.types import *
import numpy as np  

schema = StructType([
    StructField("Country", StringType()),
    StructField("State", StringType()),
    StructField("Vector_1", ArrayType(DoubleType())),    
    StructField("Vector_2",  ArrayType(DoubleType()))   
])


@pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def g(df):
    gr1 = df['Country'].iloc[0]
    gr2 = df['State'].iloc[0]

    a = np.sum(df.Vector_1)
    b = np.sum(df.Vector_2)

    return pd.DataFrame([[gr1]+[gr2]+[a]+[b]])

df.groupby("Country","State").apply(g).show()

0 Answers