kedro: train image classifier with keras ImageDataGenerator

542 views Asked by At

Which kedro dataset should be used when working with images and keras ImageDataGenerator? I know there is ImageDataset but the number of images is too large to fit in memory. And all that keras ImageDataGenerator really needs is a local folder location to the image dataset in the form of:

data/
    train/
        dogs/
            dog001.jpg
            dog002.jpg
            ...
        cats/
            cat001.jpg
            cat002.jpg
            ...
    validation/
        dogs/
            dog001.jpg
            dog002.jpg
            ...
        cats/
            cat001.jpg
            cat002.jpg
            ...

It would be possible to use a parameter specifying the data location but I think the appropriate location for data should be the Data Catalog. Is there a simple way to specify this data location in the Data Catalog?

2

There are 2 answers

8
William Ashford On

There are two parts of your question which I think are important to separate;

  1. Is it possible configure a custom ImageDataGenerator dataset? (TLDR; yes)
  2. Is it possible to configure the above with file path parameters that match my use case? (TLDR; yes but you probably don't want your directory structure being the default view of directory structures as other users might not be able to use it as well).

Is it possible configure a custom ImageDataGenerator dataset?

Here's a little bit of incomplete python code that you could use to build out a custom dataset. I'll leave it to you to get it in working shape if you want a solution like this. Look into the sample datasets in the Kedro Github Repo for inspiration and a tutorial on creating custom datasets from the kedro readthedocs.

import tensorflow as tf 
from kedro.io.core import AbstractDataSet

class ImageDataGeneratorDataSet(AbstractDataSet):

    def __init__(
        self,
        filepath: str,
        load_args: Dict[str, Any] = None,
        save_args: Dict[str, Any] = None):
        
        self.filepath = filepath
        self.load_args = load_args
        self.save_args = save_args 


    def load(self):
        generator = tf.keras.preprocessing.image.ImageDataGenerator(**self.load_args)
        return generator.flow_from_directory(self.filepath)

    def save(self, data):
        raise Exception("Saving with the ImageDataGeneratorDataSet is not supported")

2: Is it possible to configure the above with file path parameters that match my use case?

While we could modify the above to take in some parameters and return different iterators, this might give us issues if the directory structure is different. This is because parameterisation largely relies on common conventions.

If your convention is data/{train/validation}/{dog/cat}/images... your solution to extract and apply parameters is likely to be coupled to the respective order of train/validation and dog/cat and would likely not work for a different user who might have a convention of data/{dog/cat}/{train/validation}/images....

What would perhaps be a better pattern would be to implement a solution (like I've outlined in the first section), register a dataset in the catalog for each of your different training/validation data and combine the iterators at runtime within your nodes to create train and validation iterators.

For example you would have the datasets train_cats, train_dogs, validation_cats, validation_dogs. Within the node you could izip these iterators together (see https://stackoverflow.com/a/243902/13341083).

If you end up pursuing this approach, please raise a PR and contribute :) Best of luck

4
Shubham Agrawal On

How about setting the path in parameters.yml and then read that as an input to your ImageDataGenerator. It could look something like:

train_dogs_location: data/train/dogs/

Modify the above example based on what is best. You can also consider setting a global path for all datasets in the conf/base/globals.yml file. For example, for your root data folder.