Azure Synapse Polybase/External tables - return only latest file

1k views Asked by At

We have an files partitioned in the datalake and are using Azure Synapse SQL Serverless pool to query them using external tables before visualising in Power BI.

Files are stored in the following partition format {source}/{year}/{month}/{filename}_{date}.parquet

We then have an external table that loads all files for that source.

For all files that increment each day this is working great as we want all files to be included. However we have some integrations that we want to return only the latest file. (i.e. the latest file sent to us is the current state that we want to load into Power BI).

Is it possible in the external table statement to only return the latest file? Or do we have to add extra logic? We could load all the files in, and then filter for the latest filename and save that in a new location. Alternatively we could try to create an external table that changes every day.

Is there a better way to approach this?

2

There are 2 answers

0
Justin Jahn On BEST ANSWER

You can use file metadata to query for the latest file but you will need to ensure your parquet files begin with date stamps:

CREATE VIEW v_latest_user AS (
   SELECT
       *
       , t1.filename() AS [FILE_NAME]
   FROM OPENROWSET (
       BULK 'RAW/my_data/users/year=*/month=*/*.parquet',
       DATA_SOURCE = 'myDataLake',
       FORMAT = 'PARQUET'
   ) WITH (
        Serial bigint
        -- , MyColumn varchar(8), ...
   ) t1
    WHERE
        t1.filename() IN (
            SELECT TOP 1
                t2.filename()
            FROM OPENROWSET  (
                BULK 'RAW/my_data/users/year=*/month=*/*.parquet',
                DATA_SOURCE = 'myDataLake',
                FORMAT = 'PARQUET'
            ) t2
            ORDER BY
                t2.filename() DESC
        )
)

Take a look at https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-specific-files for more information on using file metadata in SQL Serverless queries.

1
JPVoogt On

If you are using dedicated pools then I would alter the location of your table with the latest files folder.

Load every day into a new folder and then alter the LOCATION of the external table to look at the current/latest day, but you might need to add additional logic to track in a control table what the latest successful load date is.

Unfortunately I have not found a better way to do this myself.