Hive can't find partitioned data written by Spark Structured Streaming

615 views Asked by At

I have a spark structured streaming job, writing data to IBM Cloud Object Storage (S3):

dataDf.
  writeStream.
  format("parquet").
  trigger(Trigger.ProcessingTime(trigger_time_ms)).
  option("checkpointLocation", s"${s3Url}/checkpoint").
  option("path", s"${s3Url}/data").
  option("spark.sql.hive.convertMetastoreParquet", false).
  partitionBy("InvoiceYear", "InvoiceMonth", "InvoiceDay", "InvoiceHour").
  start()

I can see the data using the hdfs CLI:

[clsadmin@xxxxx ~]$ hdfs dfs -ls s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0 | head
Found 616 items
-rw-rw-rw-   1 clsadmin clsadmin      38085 2018-09-25 01:01 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-1e1dda99-bec2-447c-9bd7-bedb1944f4a9.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      45874 2018-09-25 00:31 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-28ff873e-8a9c-4128-9188-c7b763c5b4ae.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin       5124 2018-09-25 01:10 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-5f768960-4b29-4bce-8f31-2ca9f0d42cb5.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      40154 2018-09-25 00:20 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-70abc027-1f88-4259-a223-21c4153e2a85.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      41282 2018-09-25 00:50 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-873a1caa-3ecc-424a-8b7c-0b2dc1885de4.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      41241 2018-09-25 00:40 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-88b617bf-e35c-4f24-acec-274497b1fd31.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin       3114 2018-09-25 00:01 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-deae2a19-1719-4dfa-afb6-33b57f2d73bb.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      38877 2018-09-25 00:10 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-e07429a2-43dc-4e5b-8fe7-c55ec68783b3.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      39060 2018-09-25 00:20 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00001-1553da20-14d0-4c06-ae87-45d22914edba.c000.snappy.parquet

However, when I try to query the data:

hive> select * from invoiceitems limit 5;
OK
Time taken: 2.392 seconds

My table DDL looks like this:

CREATE EXTERNAL TABLE `invoiceitems`(
  `invoiceno` int,
  `stockcode` int,
  `description` string,
  `quantity` int,
  `invoicedate` bigint,
  `unitprice` double,
  `customerid` int,
  `country` string,
  `lineno` int,
  `invoicetime` string,
  `storeid` int,
  `transactionid` string,
  `invoicedatestring` string)
PARTITIONED BY (
  `invoiceyear` int,
  `invoicemonth` int,
  `invoiceday` int,
  `invoicehour` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a://streaming-data-landing-zone-partitioned/data'

I've also tried with the correct case for column/partition names - this doesn't work either.

Any ideas why my query isn't finding the data?


UPDATE 1:

I have tried setting the location to a directory containing the data without partitions and this still doesn't work, so I'm wondering if it is a data formatting issue?

CREATE EXTERNAL TABLE `invoiceitems`(
  `InvoiceNo` int,
  `StockCode` int,
  `Description` string,
  `Quantity` int,
  `InvoiceDate` bigint,
  `UnitPrice` double,
  `CustomerID` int,
  `Country` string,
  `LineNo` int,
  `InvoiceTime` string,
  `StoreID` int,
  `TransactionID` string,
  `InvoiceDateString` string)
PARTITIONED BY (
  `InvoiceYear` int,
  `InvoiceMonth` int,
  `InvoiceDay` int,
  `InvoiceHour` int)
STORED AS PARQUET
LOCATION
  's3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/';

hive> Select * from invoiceitems limit 5;
OK
Time taken: 2.066 seconds
1

There are 1 answers

3
Lakshman Battini On

Read from Snappy Compression parquet file

The data is in snappy compressed Parquet file format.

s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-1e1dda99-bec2-447c-9bd7-bedb1944f4a9.c000.snappy.parquet

So set the ‘PARQUET.COMPRESS’=’SNAPPY’ table property in create table DDL statement. You can alternatively set parquet.compression=SNAPPY in the “Custom hive-site settings” section in Ambari for either IOP or HDP.

Here is an example of using the table property during a table creation statement in Hive:

hive> CREATE TABLE inv_hive_parquet( 
   trans_id int, product varchar(50), trans_dt date
    )
 PARTITIONED BY (
        year int)
 STORED AS PARQUET
 TBLPROPERTIES ('PARQUET.COMPRESS'='SNAPPY');

Update Parition metadata in External table

Also, for an external Partitioned table, we need to update the partition metadata whenever any external job (spark job in this case) writes the partitions to Datafolder directly, because hive will not be aware of these partitions unless the explicitly updated.

that can be done by either:

ALTER TABLE inv_hive_parquet RECOVER PARTITIONS;
//or
MSCK REPAIR TABLE inv_hive_parquet;