Spark Rapids: Simple HashAggregate Example

303 views Asked by At

[Hi All, I am new to Spark Rapids. I was going through the basic introduction to Spark Rapids, where I got a figure (attached) explaining the difference between CPU and GPU based query plans for hashaggregate example. All things in the plans, except the last phase converting to the Row Format is not clear to me. Can anyone please suggest the reason behind this.]

1

There are 1 answers

2
Jason Lowe On

I do not see the referenced figure, but I suspect what is happening in your particular query comes down to one of two possible cases.

If your query is performing some kind of collection of the data back to the driver (e.g.: .show or .collect in Scala or otherwise directly displaying the query results) then the columnar GPU data needs to be converted back to rows before being returned to the driver. Ultimately the driver is working with RDD[InternalRow] which is why a transition from RDD[ColumnarBatch] needs to occur in those cases.

If your query ends by writing the output to files (e.g.: to Parquet or ORC) then the plan often shows a final GpuColumnarToRow transition. Spark's Catalyst optimizer automatically inserts ColumnarToRow transitions when it sees operations that are capable of producing columnar output (i.e.: RDD[ColumnarBatch]) and then the plugin updates those transitions to GpuColumnarToRow when the previous node will operate on the GPU. However in this case the query node is a data write command, and those produce no output in the query plan sense. Output is directly written to files when the node is executed instead of sending the output to a downstream node for further processing. Therefore this is a degenerate transition in practice, as the data write command sends no data to the columnar-to-row transition. I filed an issue against the RAPIDS Accelerator to clean up that degenerate transition, but it has no impact on query performance.