I'm working on a binary classification problem with a large image dataset. I have annotations with the faces bounding boxes in a json file witch contains the image file path reference and the corresponding annotation for each sample. Like this:
{
"samples": [
{
"imageUrl": "http://s3.amazonaws.com/my-bucket/prefix/to/the/image-file.jpeg",
"annotation": [
{
"points": [
{
"x": 0.3020833333333333,
"y": 0.1921875
},
{
"x": 0.73125,
"y": 0.1921875
},
{
"x": 0.73125,
"y": 0.73125
},
{
"x": 0.3020833333333333,
"y": 0.73125
},
{
"x": 0.3020833333333333,
"y": 0.1921875
}
]
}
]
}
]
}
I want to preprocess my dataset using this annotations to crop the faces in each image, because running the bounding boxes predictions are too slow to do on the fly. Currently I'm trying to use the flow_from_dataframe() method of ImageDataGenerator to load the images based on the refences in my dataframe rows and preprocess it, but I have few problems:
- My dataset is too large to fit in memory. I have about 0.5 million images, so I need a generator.
- I have access only to the image data in the preprocessing_function (ImageDataGenerator arg). So I can't read the bounding boxes points to the current image, because I don't have the imageUrl.
I tried to create my own custom image generator, but I need some of the ImageDataGenerator features. Is there any way to take advantage of the ImageDataGenerator class instead of creating my own generator from scratch?