I'm working on a machine learning process to classify images. My problem is that my dataset is imbalanced, and in my 5 categories of images, I have about 400 images in of one class, and about 20 images of each of the other classes.
I would like to balance my train set by applying data augmentation only to certain classes of my train set.
Here's the code I'm using for creating the train an validation sets:
# Import data
data_dir = pathlib.Path(r"C:\Train set")
# Define train and validation sets (80% - 20%)
batch_size = 32
img_height = 240
img_width = 240
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
And here's how I apply data augmentation, although this would be for the entire train set:
# Apply data augmentation
data_augmentation = keras.Sequential(
[
layers.experimental.preprocessing.RandomFlip("horizontal",
input_shape=(img_height,
img_width,
3)),
layers.experimental.preprocessing.RandomRotation(0.1),
layers.experimental.preprocessing.RandomZoom(0.1),
]
)
Is there any way to go into my train set, extract those categories that have fewer images, and apply data augmentation only to them?
Thanks in advance!
I suggest not using
ImageDataGenerator
but a customizedtf.data.Dataset
. In a mapping operation, you can treat categories differently, e.g.:Let me demonstrate it. Let's make you a folder with training images:
Now you have these files:
Let's apply tranformations only to the astronaut category. Let's use the
tf.image
transformations.Then, we make the
tf.data.Dataset
:And when you iterate the dataset, you'll see that only the astronaut is flipped:
Please note, for training you will need to uncomment the two lines in
preprocess
so it returns an array of floats and an integer.