Power BI incremental refresh from Azure Blob

1.9k views Asked by At

If I have a list of many blobs in a container, can I set my RangeStart and RangeEnd parameters to be based on the modified timestamp of the csv files? My blobs are partitioned based on the created date, but the rows can be updated historically. I need to make sure that Power BI has the latest version of each row (based on updated_at timestamp)

1- filter the blobs I want based on the blob prefix (virtual directory) 2- filter the blobs based on the Date modified attribute and set up a parameter for RangeStart and RangeEnd (this limits the number of blobs which need to be looked at by a great deal) 3- sort the data and drop duplicates as a final step

Would this pattern work and does it seem efficient? My problem with using the 'updated_at' timestamp as the incremental column is that files which were created weeks or months ago might get updated (it is purely based on the customer activity). It seems like PBI would need to scan a lot of blobs in order to possibly know which rows have been updated.

I tested this out and it works on PBI desktop, but I am not seeing the parameters show up on PBI online which has me worried (it has been running for ~4 hours so far).

Edit - added sample query. The key step to make things efficient is the "filter_dates" step. I am converting the RangeStart and RangeEnd column to date_id (YYYYMMDD) to match our partitioning. This allows me to ignore any partition that incremental refresh does not require.

And finally, the autogenerated incremental refresh steps filter the actual rows on the timestamp column. These rows are guaranteed to be from partitions that contain the data though due to the filter_dates step. (FYI I removed the dtypes and column names)

let
    list_containers = AzureStorage.Blobs(account),
    list_folders = list_containers{[Name = container]}[Data],
    filter_blobs = Table.SelectRows(list_folders, each Text.StartsWith([Name], folder)),
    split_name = Table.SplitColumn(filter_blobs, "Name", Splitter.SplitTextByEachDelimiter({"="}, QuoteStyle.Csv, true), {"Name", "date_id"}),
    split_delimiter = Table.SplitColumn(split_name, "date_id", Splitter.SplitTextByDelimiter("/"), {"date_id", "filename"}),
    remove_columns = Table.RemoveColumns(split_delimiter, {"Name", "filename", "Date accessed", "Date created", "Date modified", "Attributes", "Folder Path"}),
    filter_dates = Table.SelectRows(remove_columns, each Number.FromText([date_id]) >= convert_date_id(RangeStart) and Number.FromText([date_id]) < convert_date_id(RangeEnd)),
    filter_files = Table.SelectRows(filter_dates, each [Extension] = ".parquet"),
    prepare_data = Table.AddColumn(filter_files, "data", each Parquet.Document([Content])),
    remove_content = Table.RemoveColumns(prepare_data, {"Content", "Extension"}),
    expand_data = Table.ExpandTableColumn(remove_content, "data"),
    fix_dtypes = Table.TransformColumnTypes(expand_data),
    #"records-7A6569745F757463-autogenerated_for_incremental_refresh" = Table.SelectRows(fix_dtypes, each DateTime.From([timestamp_utc]) >= RangeStart and DateTime.From([timestamp_utc]) < RangeEnd)
in
    #"records-7A6569745F757463-autogenerated_for_incremental_refresh"
0

There are 0 answers