Which kedro dataset should be used when working with images and keras ImageDataGenerator? I know there is ImageDataset but the number of images is too large to fit in memory. And all that keras ImageDataGenerator really needs is a local folder location to the image dataset in the form of:
data/
train/
dogs/
dog001.jpg
dog002.jpg
...
cats/
cat001.jpg
cat002.jpg
...
validation/
dogs/
dog001.jpg
dog002.jpg
...
cats/
cat001.jpg
cat002.jpg
...
It would be possible to use a parameter specifying the data location but I think the appropriate location for data should be the Data Catalog. Is there a simple way to specify this data location in the Data Catalog?
There are two parts of your question which I think are important to separate;
Is it possible configure a custom ImageDataGenerator dataset?
Here's a little bit of incomplete python code that you could use to build out a custom dataset. I'll leave it to you to get it in working shape if you want a solution like this. Look into the sample datasets in the Kedro Github Repo for inspiration and a tutorial on creating custom datasets from the kedro readthedocs.
2: Is it possible to configure the above with file path parameters that match my use case?
While we could modify the above to take in some parameters and return different iterators, this might give us issues if the directory structure is different. This is because parameterisation largely relies on common conventions.
If your convention is
data/{train/validation}/{dog/cat}/images...
your solution to extract and apply parameters is likely to be coupled to the respective order of train/validation and dog/cat and would likely not work for a different user who might have a convention ofdata/{dog/cat}/{train/validation}/images...
.What would perhaps be a better pattern would be to implement a solution (like I've outlined in the first section), register a dataset in the catalog for each of your different training/validation data and combine the iterators at runtime within your nodes to create train and validation iterators.
For example you would have the datasets train_cats, train_dogs, validation_cats, validation_dogs. Within the node you could izip these iterators together (see https://stackoverflow.com/a/243902/13341083).
If you end up pursuing this approach, please raise a PR and contribute :) Best of luck