We are building a solution using spark 1.6.1 where we need to read an ORC file and execute business logic on it. Spark documentation about reading an ORC file says that The columnar format lets the reader read, decompress, and process only the columns that are required for the current query. But in our case, even though SQL query selects only one column, the SparkUI shows that whole file is getting read.
I found similar question about parquet format here -> While Reading a specific Parquet Column , All column are read instead of single column which is given in Parquet-Sql. But it is not solved.
Sample code
ORC file has been created as follows:
lines = sc.textFile("file:///home/user/data/testdata.csv")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(id=int(p[0]),fname=p[1],mname=p[2],lname=p[3],address=p[4],marks=p[5],grade=p[6],age=p[7],experience=p[8]))
people.toDF().write.format("orc").save("/home/user/orcdata/people5m")
And it is read as follows:
from pyspark.sql import HiveContext, Row
sqlContext = HiveContext(sc)
people = sqlContext.read.format("orc").load("/home/user/orcdata/people5m")
people.registerTempTable("people")
addresses = sqlContext.sql("SELECT 'address' FROM people")
addresses.count()
The size of ORC data on HDFS is 91.6 M and on SparkUI, the same number '91.6 M' is shown in Input column on the stages tab. Is there anything wrong with this code? Can somebody explain this behavior?