I'm using spark with scala to read a specific Hive partition. The partition is
scala> spark.sql("select * from db.table where year=2019 and month=2 and day=28 and a='y' and b='z'").show
But I get this error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 236 in stage 0.0 failed 4 times, most recent failure: Lost task 236.3 in stage 0.0 (TID 287, server, executor 17): org.apache.hadoop.security.AccessControlException: Permission denied: user=user, access=READ, inode="/path-to-table/table/year=2019/month=2/day=27/a=w/b=x/part-00002":user:group:-rw-rw----
As you can see, spark is trying to read a different partition and I don't have permisions there.
It shouldn't be, because I created a filter and this filter is my partition.
I tried the same query with Hive and it's works perfectly (No access problems)
Hive> select * from db.table where year=2019 and month=2 and day=28 and a='y' and b='z';
Why is spark trying to read this partition and Hive doesn't?
There is a Spark configuration that am I missing?
Edit: More information
Some files were created with Hive, others were copied from one server and pasted to our server with different permissions (we can not change the permissions), then they should have refreshed the data.
We are using:
Show create table
|CREATE EXTERNAL TABLE Columns and type PARTITIONED BY (`year` int COMMENT '*', `month` int COMMENT '*', `day` int COMMENT '*', `a` string COMMENT '*', `b` string COMMENT '*') ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://path' TBLPROPERTIES ( 'transient_lastDdlTime' = '1559029332' ) |