Autoloader: pass the folder names in the data source path with file notification

257 views Asked by At

Im using the Autolaoder to laod data incrementally and I have enabled the file notification services.

My problem is that the logic I want to include while passing the the datasource path if its against the file notification logic or no.

Everyday we receive the data in lake as new folder (name is the UTC when it got loaded). I want Autolaoder to read all the folders except the latest loaded file.

To achiave that I did this code

data_source = f"abfss://{container_read}@{storage_read}.dfs.core.windows.net/{folders}/{cdm}/*.csv"

In the documentation its written that I could pass the number of different values doing like this way {ab,cd}

so my folders looks like this {2023-10-09T12.31.31Z,2023-10-09T14.02.15Z}

I realized that I will always be passing full list of folders to be read except the latest one, which is kind of in conflict with file notification idea as many of the folders that I will be passing through the folders variable will have already been read in the past.

So could someone explain how can I notify autoloader to always exclude the latest loaded folder? reason why I want to exclude the latest folder is that csv files in there are getting updated everysecond so autolaoder fails when readin it.

EDIT:

I just found in the documentation that changing the path in autoloader is not supported with AUtolaoder when file notification is enabled. So im not sure now if my above solution even works properly without losing any data to load.

1

There are 1 answers

0
JayashankarGS On

It is not possible to use your logic in autoloader without having the folder name to exclude or the time after which you don't want the files to load. This is because when you provide the path in folders, it gets notifications for all updates and checks the files to process. If a file is not present, you can handle it using ignoreMissingFiles, but in your case, the file is frequently updated and raises an error.

So, if you know the folder, let's say you don't want to read the folder 2023-10-09T14.02.15Z, then you can remove the folder from folders. Alternatively, if you know the timestamp after which you don't need the data, let's say you don't want to read the data after 13:00:00, then you can filter out the folders like below.

import re

folders = [
"2023-10-09T12.31.31Z",
"2023-10-09T13.15.00Z",
"2023-10-09T14.02.15Z",
]

threshold_time = "2023-10-09T13.00.00Z"
regex_pattern = re.compile(f"2023-10-09T(0[0-9]|1[0-2]).[0-9][0-9].[0-9][0-9]Z")
filtered_folders = [folder for folder in folders if regex_pattern.search(folder)]
print(",".join(filtered_folders))

Output: 2023-10-09T12.31.31Z

Or provide the extra options modifiedBefore and timestamp.

Below is the configuration I have given.

autoloader_config = {
"cloudFiles.format":"csv",
"cloudFiles.useNotifications": "true",
"cloudFiles.resourceGroup":resourcegroup,
"cloudFiles.clientId": client_id,
"cloudFiles.clientSecret": client_secret,
"cloudFiles.tenantId": tenant_id,
"cloudFiles.connectionString":conn_string,
"cloudFiles.subscriptionId": subscription_id,
"cloudFiles.schemaLocation":schema_location,
"header": True,
"modifiedBefore":"2023-11-14 12:20:00.000000 UTC+5:30"
}

Here, I have given modification before 12:20:00.

And the initial output:

date state store_id product_category SKU price _rescued_data file_path
2016-07-28 CA 1 A A1 7.92 null wasbs://[email protected]/autoloader/hts-sample-test.csv
2016-07-29 CA 1 A A1 7.88 null wasbs://[email protected]/autoloader/hts-sample-test.csv
2016-07-30 CA 1 A A1 8.0 null wasbs://[email protected]/autoloader/hts-sample-test.csv
2016-07-31 CA 1 A A1 7.94 null wasbs://[email protected]/autoloader/hts-sample-test.csv

Now, I am adding a new file with a last modified time greater than 12:20:00 or updating the same file after 12:20:00.

Before:

enter image description here

After update:

enter image description here

Now my stream is giving no results, as it doesn't load after 12:20:00.

enter image description here

So, until you know the timestamp or folder you don't want to process, you can ignore them from the autoloader.