count() function fails when reading data from Cassandra into pyspark dataframe

148 views Asked by At

I am reading data from Cassandra as :

df = spark.read\
    .format("org.apache.spark.sql.cassandra")\
    .options(**configs)\
    .options(table=tablename, keyspace=keyspace)\
    .option("ssl", True)\
    .option("sslmode", "require")\
    .load()

Now this df is pyspark dataframe. I able to perform show(), printSchema() function on this df but when I am printing

df.count()

it's throwing error:

An error was encountered:
An error occurred while calling o1394.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in stage 
48.0 failed 4 times, most recent failure: Lost task 19.3 in stage 48.0 (TID 2053, js- 
56258-63801-i-32-w-1.net, executor 9): java.lang.IllegalArgumentException: 
requirement failed: Column not found in Java driver Row: count

How I can resolve this issue? Thanks in advance

1

There are 1 answers

0
stevenlacerda On

I'm assuming it's not failing at the same stage all of the time. If that's the case, then you can try tuning the read/write parameters:

https://github.com/datastax/spark-cassandra-connector/blob/b2.4/doc/reference.md#read-tuning-parameters

https://github.com/datastax/spark-cassandra-connector/blob/b2.4/doc/reference.md#write-tuning-parameters

When you start pyspark, you'll need to pass the --conf spark.cassandra.<option> in.