I am trying to load image data for model training from a self-hosted S3 storage (MinIO). Pytorch provides new datapipes with this functionality in the torchdata library.
So within my function to create the datapipe, I have these lines:
dp_s3 = IterableWrapper(list(sample_dict.keys()))
dp_s3 = dp_s3.load_files_by_s3()
dp_s3 = dp_s3.map(open_image)
dp_s3 = dp_s3.map(transform)
The problem with this approach is, that the S3 file loader datapipe returns a tuple of a string, which contains the file path on the S3 storage as label and io.BytesIO containing the image data. However I have all labels and the files to load in a separate text files, which are loaded into sample_dict (a dictionary mapping file paths to classification labels) in a previous step.
Question is now, how can I get the labels from sample_dict into my mapping functions?
There seem to be two main obstacles to achieve this:
- The dataloader is multi-threaded and I get a pickle error if I add
sample_dictto it. Also I cannot make the dictionary globally accessible for other worker threads which are handled by pytorch load_files_bys3()is the functional name forS3FileLoaderwhich can only deal with S3 type file paths as input. My initial though was that I need to us a map-style datapipe for this instead of a iterable-style, but unfortunately there are no map-style S3 datapipes available.
I think I found the answer by using just plain and simple
functool.partialand use that to map my function withsample_dictas fixed input:and
I have still have to test this before marking the question as answered, but initial debugging look promising.
Additionally, there is already a open feature request on the torchdata repo which seems to address this problem as well.