How do I combine multiple video inputs into a (2+1)D ResNet

24 views Asked by Ronnie Piku At 15 March 2024 at 11:54

I have developed a v1 of a (2+1)D ResNet which takes in pixel data per frame as an input and is used to predict bounding box coordinates of up to 8 objects in that video. The shape of my current input is:

(batch_size, n_frames, height, width, channels)

And my output is of shape:

(n_frames, 32)

I am using Intersection over Union (IoU) as loss and am seeing some relatively poor results. I thought to increase this by increasing the number of features in the model (the dataset is quite small but it will increase in the future). The features I have extracted from my videos are:

edges
motion vectors
color histograms
optical flow
textures

How do I utilise these features to get better predictions from my model?

My first step was to get my pixel data, a single feature and labels into a list. Then I created training, test and val splits. These were turned into datasets using a frame generator class.

I then created the following architecture:

input_shape = (None, None, HEIGHT, WIDTH, 4)
frames_input = layers.Input(shape=(None, HEIGHT, WIDTH, 3))
edges_input = layers.Input(shape=(None, HEIGHT, WIDTH, 1))
merged_input = layers.concatenate(\[frames_input, edges_input\], axis=-1)

# Reshape input tensor to include time dimension of varying length

x = layers.Reshape((-1, HEIGHT, WIDTH, 4))(merged_input)

x = Conv2Plus1D(filters=FILTERS, kernel_size=KERNAL_SIZE, padding='same')(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = ResizeVideo(HEIGHT // 2, WIDTH // 2)(x)

# Block 1

x = add_residual_block(x, 16, (3, 3, 3))
x = ResizeVideo(HEIGHT // 4, WIDTH // 4)(x)

# Block 2

x = add_residual_block(x, 32, (3, 3, 3))
x = ResizeVideo(HEIGHT // 8, WIDTH // 8)(x)

# Block 3

x = add_residual_block(x, 64, (3, 3, 3))
x = ResizeVideo(HEIGHT // 16, WIDTH // 16)(x)

# Block 4

x = add_residual_block(x, 128, (3, 3, 3))

# Apply TimeDistributed dense layer to output bounding box coordinates for each frame

x = TimeDistributed(layers.GlobalAveragePooling2D())(x)  # Convert spatial dimensions to single dimension
x = TimeDistributed(layers.Dense(32))(x)

BoundingBoxV2_model = keras.Model([frames_input, edges_input], x)

And built the model like so:

sampled_frames, sampled_edges, sampled_labels = next(iter(train_ds)) BoundingBoxV2_model.build([sampled_frames, sampled_edges])

When i try to fit my model I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In\[149\], line 1
\----\> 1 history = BoundingBoxV2_model.fit(x = train_ds,
2                                   epochs = EPOCHS,
3                                   validation_data = val_ds)

File c:\\Users\\Rpiku\\miniconda3\\envs\\rally_stream\\lib\\site-packages\\keras\\utils\\traceback_utils.py:70, in filter_traceback.\<locals\>.error_handler(\*args, \*\*kwargs)
67     filtered_tb = \_process_traceback_frames(e.__traceback__)
68     # To get the full stack trace, call:
69     # `tf.debugging.disable_traceback_filtering()`
\---\> 70     raise e.with_traceback(filtered_tb) from None
71 finally:
72     del filtered_tb

File \~\\AppData\\Local\\Temp\__autograph_generated_file3rk3lb1s.py:15, in outer_factory.\<locals\>.inner_factory.\<locals\>.tf__train_function(iterator)
13 try:
14     do_return = True
\---\> 15     retval_ = ag_\_.converted_call(ag_\_.ld(step_function), (ag_\_.ld(self), ag_\_.ld(iterator)), None, fscope)
16 except:
17     do_return = False

ValueError: in user code:

    File "c:\Users\Rpiku\miniconda3\envs\rally_stream\lib\site-packages\keras\engine\training.py", line 1160, in train_function  *

...
File "c:\\Users\\Rpiku\\miniconda3\\envs\\rally_stream\\lib\\site-packages\\keras\\engine\\input_spec.py", line 216, in assert_input_compatibility
raise ValueError(

    ValueError: Layer "model_8" expects 2 input(s), but it received 1 input tensors. Inputs received: [<tf.Tensor 'IteratorGetNext:0' shape=(None, None, 224, 224, 3) dtype=float32>]`

Original Q&A

TechQA.

How do I combine multiple video inputs into a (2+1)D ResNet

There are 0 answers

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in COMPUTER-VISION

Related Questions in CONV-NEURAL-NETWORK

Related Questions in RESNET

Popular Questions

Trending Questions