Im using the Autolaoder to laod data incrementally and I have enabled the file notification services.
My problem is that the logic I want to include while passing the the datasource path if its against the file notification logic or no.
Everyday we receive the data in lake as new folder (name is the UTC when it got loaded). I want Autolaoder to read all the folders except the latest loaded file.
To achiave that I did this code
data_source = f"abfss://{container_read}@{storage_read}.dfs.core.windows.net/{folders}/{cdm}/*.csv"
In the documentation its written that I could pass the number of different values doing like this way {ab,cd}
so my folders
looks like this {2023-10-09T12.31.31Z,2023-10-09T14.02.15Z}
I realized that I will always be passing full list of folders to be read except the latest one, which is kind of in conflict with file notification idea as many of the folders that I will be passing through the folders
variable will have already been read in the past.
So could someone explain how can I notify autoloader to always exclude the latest loaded folder? reason why I want to exclude the latest folder is that csv files in there are getting updated everysecond so autolaoder fails when readin it.
EDIT:
I just found in the documentation that changing the path in autoloader is not supported with AUtolaoder when file notification is enabled. So im not sure now if my above solution even works properly without losing any data to load.
It is not possible to use your logic in autoloader without having the folder name to exclude or the time after which you don't want the files to load. This is because when you provide the path in
folders
, it gets notifications for all updates and checks the files to process. If a file is not present, you can handle it usingignoreMissingFiles
, but in your case, the file is frequently updated and raises an error.So, if you know the folder, let's say you don't want to read the folder
2023-10-09T14.02.15Z
, then you can remove the folder fromfolders
. Alternatively, if you know the timestamp after which you don't need the data, let's say you don't want to read the data after13:00:00
, then you can filter out thefolders
like below.Output:
2023-10-09T12.31.31Z
Or provide the extra options
modifiedBefore
and timestamp.Below is the configuration I have given.
Here, I have given modification before
12:20:00
.And the initial output:
Now, I am adding a new file with a last modified time greater than
12:20:00
or updating the same file after12:20:00
.Before:
After update:
Now my stream is giving no results, as it doesn't load after
12:20:00
.So, until you know the timestamp or folder you don't want to process, you can ignore them from the autoloader.