I cannot find any reference for dealing with file metadata when creating an External Table starting from a partitioned source of files. More precisely: I have a set of partitioned parquet files. The partition strategy is in the form:
{YEAR}/{MONTH}/{filename}.parquet
Now I can create an external table referencing the whole set using the LOCATION
pointing at the root of the partition and using s recursive strategy.
LOCATION = 'folder_or_filepath' Specifies the folder or the file path and file name for the actual data in Hadoop or Azure blob storage. The location starts from the root folder. The root folder is the data location specified in the external data source.
In this context, it would be crucial to be able to access partitioning metadata like {YEAR}
, {MONTH}
or {filename}
and store them as columns into the newly created external table for further usages.
By my researches, access file metadata seems to be a missing feature right now. But I'm not sure. For sure, it is not possible leverages on PARTITION BY functionality as evidenced here:
Is there some mitigation strategy? I'm about to set up a Data Factory Mapping Dataflow which will do the dirty job. But I'm still unsure about these two options:
- Reducing the partitioned set into a single file adding metadata columns on each row;
- Just adding metadata columns on each file and leave the partitioned hierarchy;
- Bonus: any suggestion?