I am working on training a Convolutional Neural Network (CNN) with satellite imagery data to estimate population density from RGB images. The satellite imagery is stored in a TIFF file, and I am using xarray to open the data. I am facing challenges in efficiently generating datasets for training.
I want to create different images for each training epoch by sampling random images from each section of the xarray. Currently, I have a function that generates a random point from each polygon in the dataset and creates an image. However, using tf.py_function for this function has resulted in a bottleneck during image generation, as the GPU is constantly waiting for the data to be generated. I think the data is not being prefetched because of the usage of tf.py_function.
This is a minimal version of the code I'm currently working with:
import xarray as xr
import geopandas as gpd
import numpy as np
import tensorflow as tf
from Shapely import Point
img_size = 128
dataset = xr.open_dataset("sat_img.tif")
gdf = gpd.read_file("neighbours.parquet")
def find_nearest_idx(array, value):
''' Returns the index of the nearest value of {value} from the array'''
array = np.asarray(array)
idx = (np.abs(array - value)).argmin()
return idx
def random_point_from_geometry(polygon, size=100):
'''Generates a random point within the bounds of a Polygon.'''
# Get bounds of the shapefile's polygon
(minx, miny, maxx, maxy) = polygon.bounds
# Loop until finding a random point inside the polygon
while 0 == 0:
# generate random data within the bounds
x = np.random.uniform(minx, maxx, 1)
y = np.random.uniform(miny, maxy, 1)
point = Point(x, y)
if polygon.contains(point):
return x, y
def get_data(i):
''' Generates image and label for the neighbour i from the GeoDataFrame. '''
# Get the polygon and the label of that index (neighbour)
polygon = gdf.iloc[i]["geometry"]
label = gdf.iloc[i]["label"]
# Sample point from the polygon's box
x, y = random_point_from_geometry(
polygon
)
# Find the rearest raster of this random point
idx_x = find_nearest_idx(dataset.x, x)
idx_y = find_nearest_idx(dataset.y, y)
# Create the indexes of the box of the image
idx_x_min = round(idx_x - img_size / 2)
idx_x_max = round(idx_x + img_size / 2)
idx_y_min = round(idx_y - img_size / 2)
idx_y_max = round(idx_y + img_size / 2)
image_ds = dataset.isel(
x=slice(idx_x_min, idx_x_max),
y=slice(idx_y_min, idx_y_max)
)
image_da = image_ds.band_data
image = image.to_numpy()
return image, label
## The tensorflow section, based on:
# https://medium.com/@acordier/tf-data-dataset-generators-with-parallelization-the-easy-way-b5c5f7d2a18
# Generator for the index
dataset = tf.data.Dataset.from_generator(
lambda: list(range(df.shape[0])), # The index generator,
tf.uint8,
) # Creates a dataset with only the indexes (0, 1, 2, 3, etc.)
dataset = dataset.map(
lambda i: tf.py_function( # The actual data generator. Passes the index to the function that will process the data.
func=get_data, inp=[i], Tout=[tf.uint8, tf.float32]
),
num_parallel_calls=tf.data.experimental.AUTOTUNE,
)
...
Any sugestions on how can I make this process more efficient with tf.function or any other methods? Currently the GPU is most of the time waiting for the images to be generated...