Optimizing TensorFlow Image Generation for CNN Training with xarray Data

80 views Asked by At

I am working on training a Convolutional Neural Network (CNN) with satellite imagery data to estimate population density from RGB images. The satellite imagery is stored in a TIFF file, and I am using xarray to open the data. I am facing challenges in efficiently generating datasets for training.

I want to create different images for each training epoch by sampling random images from each section of the xarray. Currently, I have a function that generates a random point from each polygon in the dataset and creates an image. However, using tf.py_function for this function has resulted in a bottleneck during image generation, as the GPU is constantly waiting for the data to be generated. I think the data is not being prefetched because of the usage of tf.py_function.

This is a minimal version of the code I'm currently working with:

import xarray as xr
import geopandas as gpd
import numpy as np
import tensorflow as tf
from Shapely import Point

img_size = 128
dataset = xr.open_dataset("sat_img.tif")
gdf = gpd.read_file("neighbours.parquet")

def find_nearest_idx(array, value):
    ''' Returns the index of the nearest value of {value} from the array'''
    array = np.asarray(array)
    idx = (np.abs(array - value)).argmin()
    return idx

def random_point_from_geometry(polygon, size=100):
    '''Generates a random point within the bounds of a Polygon.'''

    # Get bounds of the shapefile's polygon
    (minx, miny, maxx, maxy) = polygon.bounds

    # Loop until finding a random point inside the polygon
    while 0 == 0:
        # generate random data within the bounds
        x = np.random.uniform(minx, maxx, 1)
        y = np.random.uniform(miny, maxy, 1)
        point = Point(x, y)
        if polygon.contains(point):
            return x, y
 
def get_data(i):
    ''' Generates image and label for the neighbour i from the GeoDataFrame. '''
    # Get the polygon and the label of that index (neighbour)
    polygon = gdf.iloc[i]["geometry"]
    label = gdf.iloc[i]["label"]
     
    # Sample point from the polygon's box
    x, y = random_point_from_geometry(
        polygon
    )
    
    # Find the rearest raster of this random point
    idx_x = find_nearest_idx(dataset.x, x)
    idx_y = find_nearest_idx(dataset.y, y)
    
    # Create the indexes of the box of the image
    idx_x_min = round(idx_x - img_size / 2)
    idx_x_max = round(idx_x + img_size / 2)
    idx_y_min = round(idx_y - img_size / 2)
    idx_y_max = round(idx_y + img_size / 2)
    
    image_ds = dataset.isel(
        x=slice(idx_x_min, idx_x_max),
        y=slice(idx_y_min, idx_y_max)
    )
    image_da = image_ds.band_data
    image = image.to_numpy()
    
    return image, label

## The tensorflow section, based on: 
#   https://medium.com/@acordier/tf-data-dataset-generators-with-parallelization-the-easy-way-b5c5f7d2a18
# Generator for the index
dataset = tf.data.Dataset.from_generator(
    lambda: list(range(df.shape[0])),  # The index generator,
    tf.uint8,
)  # Creates a dataset with only the indexes (0, 1, 2, 3, etc.)

dataset = dataset.map(
    lambda i: tf.py_function(  # The actual data generator. Passes the index to the function that will process the data.
        func=get_data, inp=[i], Tout=[tf.uint8, tf.float32]
    ),
    num_parallel_calls=tf.data.experimental.AUTOTUNE,
)

...


Any sugestions on how can I make this process more efficient with tf.function or any other methods? Currently the GPU is most of the time waiting for the images to be generated...

0

There are 0 answers