When converting a Pandas dataframe into a Spark dataframe, is it possible to cast float into long?

Question

When converting a Pandas dataframe into a Spark dataframe, is it possible to cast float into long?

2.7k views Asked by 0x90 At 03 January 2022 at 13:57

I have the following prelude code that is shared between my two scenarios:

from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F
    
import pandas as pd
import numpy as np
    
spark = SparkSession.builder.getOrCreate()

df = pd.DataFrame({"col1": [1, 2, 3], "col2": [22.0, 88.0, np.nan]})

Now, I would like to convert df into a pyspark dataframe (sdf). When I try to "cast" "col2" implicitly into LongType via a schema during the creation of sdf it fails:

schema = StructType([StructField("col1", LongType()), StructField("col2", LongType())])
sdf = spark.createDataFrame(df[schema.fieldNames()], schema=schema)

Error:

TypeError: field col2: LongType can not accept object 22.0 in type <class 'float'>

But If I run the following snippet it works just fine:

schema_2 = StructType(
    [StructField("col1", LongType()), StructField("col2", FloatType())]
)
sdf = spark.createDataFrame(df[schema.fieldNames()], schema=schema_2)
cast_sdf = sdf.withColumn("col2", F.col("col2").cast(LongType()))
cast_sdf.show()

with the output:

+----+----+                                                                     
|col1|col2|
+----+----+
|   1|  22|
|   2|  88|
|   3|   0|
+----+----+

Original Q&A

There are 1 answers

**blackbishop** · Accepted Answer · 2022-01-03T16:13:04+00:00

Transforming my comment into answer.

This is actually how Spark works with schemas. It is not specific to pandas dataframe being converted into pyspark dataframe. You'll get the same error when using createDataframe method with list of tuples:

import numpy as np

schema = StructType([StructField("col1", LongType()), StructField("col2", LongType())])
df = spark.createDataFrame([(1, 22.0), (2, 88.0), (3, np.nan)], schema)

# TypeError: field col2: LongType can not accept object 22.0 in type <class 'float'>

This is also the behavior with DataSources like CSV when you pass schema (although when reading CSV it does not fail with mode PERMISSIVE but values are loaded as null). Because the schema does not make automatic casting of types, it just tells Spark which datatype should be there for each column in rows.

So when using schema, you have to pass data that matches the specified types or use StringType which does not fail, then use explicit casting to convert your columns into desired types.

schema = StructType([StructField("col1", LongType()), StructField("col2", StringType())])

df = spark.createDataFrame([(1, 22.0), (2, 88.0), (3, np.nan)], schema)

df = df.withColumn("col2", F.col("col2").cast("long"))
df.show()

#+----+----+
#|col1|col2|
#+----+----+
#|   1|  22|
#|   2|  88|
#|   3|null|
#+----+----+

TechQA.

When converting a Pandas dataframe into a Spark dataframe, is it possible to cast float into long?

There are 1 answers

Related Questions in PANDAS

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in PYTHON-3.7

Popular Questions

Trending Questions