I have the following prelude code that is shared between my two scenarios:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F
import pandas as pd
import numpy as np
spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame({"col1": [1, 2, 3], "col2": [22.0, 88.0, np.nan]})
Now, I would like to convert df
into a pyspark dataframe (sdf
). When I try to "cast" "col2"
implicitly into LongType
via a schema during the creation of sdf
it fails:
schema = StructType([StructField("col1", LongType()), StructField("col2", LongType())])
sdf = spark.createDataFrame(df[schema.fieldNames()], schema=schema)
Error:
TypeError: field col2: LongType can not accept object 22.0 in type <class 'float'>
But If I run the following snippet it works just fine:
schema_2 = StructType(
[StructField("col1", LongType()), StructField("col2", FloatType())]
)
sdf = spark.createDataFrame(df[schema.fieldNames()], schema=schema_2)
cast_sdf = sdf.withColumn("col2", F.col("col2").cast(LongType()))
cast_sdf.show()
with the output:
+----+----+
|col1|col2|
+----+----+
| 1| 22|
| 2| 88|
| 3| 0|
+----+----+
Transforming my comment into answer.
This is actually how Spark works with schemas. It is not specific to pandas dataframe being converted into pyspark dataframe. You'll get the same error when using
createDataframe
method with list of tuples:This is also the behavior with DataSources like CSV when you pass schema (although when reading CSV it does not fail with mode
PERMISSIVE
but values are loaded as null). Because the schema does not make automatic casting of types, it just tells Spark which datatype should be there for each column in rows.So when using schema, you have to pass data that matches the specified types or use
StringType
which does not fail, then use explicit casting to convert your columns into desired types.