Is there a way for using describe function in PySpark for more than one column?

211 views Asked by At

I am trying to get some info from a dataset in PySpark and when I combine select function with describe function to see three columns details, the result just showing the last column's information. I used a simple example from an article with this command:

my_data.select('Isball', 'Isboundary', 'Runs').describe().show()

and it should show me three columns details but it just show me this:

+-------+------------------+
|summary|              Runs|
+-------+------------------+
|  count|               605|
|   mean|0.9917355371900827|
| stddev| 1.342725481259329|
|    min|                 0|
|    max|                 6|
+-------+------------------+

what should I do to get the results that I am looking for?

1

There are 1 answers

0
walking On BEST ANSWER

The describe function works only on numeric and string columns as described in the documentation.

I'm assuming Isball and Isboundary are boolean columns thus their describe can't be seen. you can cast the columns to integer for it to work.

from pyspark.sql.functions import col

df = spark.createDataFrame([
    (1, True, "lorem"),
    (2, False, "ipsum")
], ["integer_col", "bool_col", "string_col"])

df.describe().show(truncate=0)

+-------+------------------+----------+
|summary|integer_col       |string_col|
+-------+------------------+----------+
|count  |2                 |2         |
|mean   |1.5               |null      |
|stddev |0.7071067811865476|null      |
|min    |1                 |ipsum     |
|max    |2                 |lorem     |
+-------+------------------+----------+


df.withColumn("bool_col", col("bool_col").cast("integer")).describe().show(truncate=0)

+-------+------------------+------------------+----------+
|summary|integer_col       |bool_col          |string_col|
+-------+------------------+------------------+----------+
|count  |2                 |2                 |2         |
|mean   |1.5               |0.5               |null      |
|stddev |0.7071067811865476|0.7071067811865476|null      |
|min    |1                 |0                 |ipsum     |
|max    |2                 |1                 |lorem     |
+-------+------------------+------------------+----------+