Let's say I have a dataset comprising greyscale videos. The length and size of each video can vary so I am representing the data in three dimensions via the following shape:
Batch size | time | y | x | channels |
---|---|---|---|---|
None | None | None | None | 1 |
I want to extract features (say 16 of them) from the temporal dimension while keeping the same spatial dimensions, which would give me the following output shape:
Batch size | y | x | filters |
---|---|---|---|
None | None | None | 16 |
Notably, the shape of the data has been reduced by one dimension. In my head, I should be able to accomplish this with a Conv3D operator (feature generation) followed by some aggregating operation (global/average pooling or some linear operator) over the time dimension only. The resulting shape would be (None, 1, None, None, 16) which I believe I could reduce to (None, None, None, 16) using this answer.
My problem is that I cannot figure out how to apply any of Keras's global operators along a single dimension. Since the size of the time dimension is unknown, I cannot specify the window size (unknown, 1, 1) for a MaxPooling or convolutional layer that would span the entire time dimension. On the other hand, the GlobalMaxPooling layer doesn't accept an argument for specifying which dimensions to operate on.
Do I have to implement some complicated custom layer for this, or does a solution already exist? I've taken a look at the Reshape layer with MaxPooling1D, but I run into the same problem of not knowing the size of the x and y dimensions to reassemble the spatial structure after the pooling operation.
I believe you could do something like following if you don't want using custom layer:
Using custom layer to doing for global average pool across time: