Spark ORC reader is reading complete file even though single column is queried

617 views Asked by At

We are building a solution using spark 1.6.1 where we need to read an ORC file and execute business logic on it. Spark documentation about reading an ORC file says that The columnar format lets the reader read, decompress, and process only the columns that are required for the current query. But in our case, even though SQL query selects only one column, the SparkUI shows that whole file is getting read.

I found similar question about parquet format here -> While Reading a specific Parquet Column , All column are read instead of single column which is given in Parquet-Sql. But it is not solved.

Sample code

ORC file has been created as follows:

lines = sc.textFile("file:///home/user/data/testdata.csv")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(id=int(p[0]),fname=p[1],mname=p[2],lname=p[3],address=p[4],marks=p[5],grade=p[6],age=p[7],experience=p[8]))
people.toDF().write.format("orc").save("/home/user/orcdata/people5m")

And it is read as follows:

from pyspark.sql import HiveContext, Row
sqlContext = HiveContext(sc)
people = sqlContext.read.format("orc").load("/home/user/orcdata/people5m")
people.registerTempTable("people")
addresses = sqlContext.sql("SELECT 'address' FROM people")
addresses.count()

The size of ORC data on HDFS is 91.6 M and on SparkUI, the same number '91.6 M' is shown in Input column on the stages tab. Is there anything wrong with this code? Can somebody explain this behavior?

Spark UI

0

There are 0 answers