I have a fileA in orc with the following format
key
id_1
id_2
value
value_1
....
value_30
If I use the following config:
'spark.sql.orc.filterPushdown' : true
And my code looks like this:
val filter_A = fileA_DF
.filter(fileA_DF("value.value_1") > lit(some_value))
.select("key.id_")
the size of the file read will be the same as
val filter_A = fileA_DF
.filter(fileA_DF("value.value_1") > lit(some_value))
.select("*")
Shouldn't spark only
- predicate pushdown - read files and stripes that satisfy the filter
- projection pushdown - read columns that we are used?
I also checked with similar sized avro file and found no improvement in selection speed
Am i measuring orc the wrong way?
If you look at let's take the following reproducible example:
If we now read this file as with the filters that you apply and use the
explainmethod, we see the following:We see that there are some
DataFilters/PushedFiltersat work, so predicate pushdown is working. If you want to really avoid to read full files, you need to make sure your input file is properly partitioned. Some more info about those filters here.Now, we do see indeed that both the
keyand thevaluecolumn are being read in, but that is because aPushedFilteralone does not guarantee that you absolutely don't read in any value where the filter predicate is false, it just applies a prefilter on the file-level (more info in this SO answer). So we will actually have to apply that filter in our Spark DAG as well (which you see in the output ofexplain).So, to wrap it up:
keyandvaluecolumns have to be read in. This is because your filter operation requires thevaluecolumn and the final column you're interested in is thekeycolumn and because thePushedFilteralone does not guarantee your predicate to be true.