I am using BigQuery to query an external data source (also known as a federated table), where the source data is a hive-partitioned parquet table stored in google cloud storage. I used this guide to define the table.
My first query to test this table looks like the following
SELECT * FROM my_dataset.my_table WHERE year=2019 AND day = "2019-01-01" LIMIT 10
This query fails with the following error
Column visitor_partition has type INT64, which differs from the expected type of INT32
I want to know why this error occurs and how to work around it or solve it. I have performed a bit of investigation using the parquet-tools
library to peek inside the internals of my parquet data.
When I run java -jar ./parquet-tools-1.10.0.jar meta test.c000.gz.parquet | grep visitor_partition
over one of my parquet files, the following relevant lines are returned
visitor_partition: OPTIONAL INT64 R:0 D:1
visitor_partition: INT64 GZIP DO:0 FPO:59420041 SZ:54561/537912/9.86 VC:633590 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 0, max: 99, num_nulls: 0]
When I run the schema, command, the following relevant schema info comes up optional int64 visitor_partition;
So its clear that in this parquet file, the data for the visitor_partition
field is represented as an INT64. So why does BigQuery expect to see an INT32 type here?
Note that, the schema of the external table is inferred from the last file sorted by the file names lexicographically among the list of all files that match the source URI of the table. So any chance that particular Parquet file in your case has a different schema than the one you described, e.g., a INT32 column with DATE logical type for the "visitor_partition" field -- which BigQuery would infer as DATE type.