Scan function from Theano replicates non_sequences shared variables

385 views Asked by At

I'm trying to implement a custom convolutional layer for a CNN network in Theano, and in order to do so I'm using the scan function. The idea is to apply the new convolution mask to each pixel.

The scan function compiles correctly, but for some reason I get an out-of-memory error. The debug (see below) indicates that the non_sequences variables are replicated for each instance of the loop (for each pixel), which of course kills my GPU memory:

def convolve_location(index, input, bias):
    hsize = self.W.shape / 2
    t = T.switch(index[0]-hsize[0] < 0, 0, index[0]-hsize[0])
    l = T.switch(index[1]-hsize[1] < 0, 0, index[1]-hsize[1])
    b = T.switch(index[0]+hsize[0] >= input.shape[2], input.shape[2]-1, index[0]+hsize[0])
    r = T.switch(index[1]+hsize[1] >= input.shape[3], input.shape[3]-1, index[1]+hsize[1])

    r_image = (input[:, :, t:b, l:r] - input[:, :, index[0], index[1]][:, :, None, None]) ** 2
    r_delta = (bias[:, :, t:b, l:r] - bias[:, :, index[0], index[1]][:, :, None, None]) ** 2
    return T.sum(r_image*r_delta)

# # Define cost function over all pixels
self.inds = theano.shared(np.array([(i, j) for i in range(self.image_shape[2]) for j in range(self.image_shape[3])], dtype='int32'), borrow=True)
self.cost = T.sum(theano.scan(
    fn=convolve_location,
    outputs_info=None,
    sequences=[self.inds],
    non_sequences=[self.input, self.b],
    n_steps=np.prod(self.image_shape[-2:])
)[0])

Here's the output from the debugger:

MemoryError: alloc failed Apply node that caused the error: Alloc(TensorConstant{0.0}, TensorConstant{1025}, TensorConstant{2000}, TensorConstant{3}, TensorConstant{32}, TensorConstant{32}) Inputs types: [TensorType(float32, scalar), TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar)] Inputs shapes: [(), (), (), (), (), ()] Inputs strides: [(), (), (), (), (), ()] Inputs values: [array(0.0, dtype=float32), array(1025), array(2000), array(3), array(32), array(32)]

Debugprint of the apply node:  Alloc [@A] <TensorType(float32, 5D)> '' |TensorConstant{0.0} [@B] <TensorType(float32, scalar)>  |TensorConstant{1025} [@C] <TensorType(int64, scalar)>  |TensorConstant{2000} [@D] <TensorType(int64, scalar)>  |TensorConstant{3} [@E] <TensorType(int64, scalar)>  |TensorConstant{32} [@F] <TensorType(int64, scalar)>  |TensorConstant{32} [@F] <TensorType(int64, scalar)> Storage map footprint:
 - CudaNdarrayConstant{[[[[ 0.]]]]}, Shape: (1, 1, 1, 1), ElemSize: 4 Byte(s), TotalSize: 4 Byte(s)
 - Constant{18}, Shape: (1,), ElemSize: 8 Byte(s), TotalSize: 8.0 Byte(s)

 - TensorConstant{(1, 1) of 0}, Shape: (1, 1), ElemSize: 1 Byte(s), TotalSize: 1 Byte(s)
 - Constant{1024}, Shape: (1,), ElemSize: 8 Byte(s), TotalSize: 8.0 Byte(s)
 - Constant{-1}, Shape: (1,), ElemSize: 8 Byte(s), TotalSize: 8.0 Byte(s)
 - TensorConstant{32}, Shape: (1,), ElemSize: 8 Byte(s), TotalSize: 8.0 Byte(s)
 - Subtensor{:int64:}.0, Shape: (1024,), ElemSize: 4 Byte(s), TotalSize: 4096 Byte(s)
 - Constant{34}, Shape: (1,), ElemSize: 8 Byte(s), TotalSize: 8.0 Byte(s)
 - Constant{2}, Shape: (1,), ElemSize: 8 Byte(s), TotalSize: 8.0 Byte(s)
 - TensorConstant{[2000    3..  32   32]}, Shape: (4,), ElemSize: 8 Byte(s), TotalSize: 32 Byte(s)
 - Reshape{4}.0, Shape: (2000, 3, 32, 32), ElemSize: 4 Byte(s), TotalSize: 24576000 Byte(s)
 - TensorConstant{(1, 1, 1, 1) of 0}, Shape: (1, 1, 1, 1), ElemSize: 1 Byte(s), TotalSize: 1 Byte(s)
 - CudaNdarrayConstant{[[[[ 0.1]]]]}, Shape: (1, 1, 1, 1), ElemSize: 4 Byte(s), TotalSize: 4 Byte(s)
 - <TensorType(float32, matrix)>, Shape: (50000, 3072), ElemSize: 4 Byte(s), TotalSize: 614400000 Byte(s)

The input as you can see is shown as a 1025x2000x3x32x32 tensor, while the original tensor is of size 2000x3x32x32, and the 1025 is the number of iterations of scan + 1.

Why are the non_sequences variables replicated for each iteration instead of simply being reused, and how can I fix it?

EDIT:

Both self.input and self.b are shared variables. Self.input is passed to the class when initialized, while self.b is defined inside the class as follows:

self.b = theano.shared(np.zeros(image_shape, dtype=theano.config.floatX), borrow=True)
1

There are 1 answers

0
Pascal Lamblin On BEST ANSWER

It is possible that, when the scan is first created or at some point during the optimization process, a symbolic Alloc with that shape is created. However, it should be optimized at a later stage of the optimization process.

We are aware that there was a but related to that recently, which should now be fixed in the development ("bleeding-edge") version of Theano. In fact, I just tried your snippet (slightly edited) with a recent development version, and had no memory error. Moreover, there was no 5D tensor anywhere in the computation graph, which would suggest the bug has indeed been fixed.

Finally, please be aware that operations such as convolutions, which are not really recurrent, will probably be much slower when expressed with scan rather than with one of the existing convolution operations. In particular, scan will not be able to parallelize efficiently when the iterations of the loop do not depend on each other.