Torch Lua: Why is my gradient descent not optimizing the error?

768 views Asked by At

I've been trying to implement a siamese neural network in Torch/Lua, as I already explained here. Now I have my first implementation, that I suppose to be good.

Unfortunately, I'm facing a problem: during training back-propagation, the gradient descent does not update the error. That is, it always computes the same value (that is +1 or -1), without changing it. In a correct implementation, the error should go from +1 to -1 or from -1 to +1. In my case, it's just stuck in the upper value and nothing changes.

Why? I'm really looking for someone that could give me some hints.

Here's my working code, that you might try to run:

LEARNING_RATE_CONST = 0.01;
output_layer_number = 1;
MAX_ITERATIONS_CONST = 10;

require 'os'
require 'nn'

-- rounds a real number num to the number having idp values after the dot
function round(num, idp)
  local mult = 10^(idp or 0)
  return math.floor(num * mult + 0.5) / mult
end

-- gradient update for the siamese neural network
function gradientUpdate(perceptron, dataset_vector, targetValue, learningRate, max_iterations)

print('gradientUpdate()\n')

  for i = 1, max_iterations do

      predictionValue = perceptron:forward(dataset_vector)[1]
      sys.sleep(0.2)
      gradientWrtOutput = torch.Tensor({-targetValue});

      perceptron:zeroGradParameters()
      perceptron:backward(dataset_vector, gradientWrtOutput) -- 

      perceptron:updateParameters(learningRate)

      predictionValue = perceptron:forward(dataset_vector)[1]
      io.write("i="..i..") optimization predictionValue= "..predictionValue.."\n");

      if(predictionValue==targetValue) then
      io.write("\t@@@ (i="..i..") optimization predictionValue "..predictionValue.." @@@\n");
      break
    end

    end
  return perceptron;
end

input_number = 6; -- they are 6
dim = 10
hiddenUnits = 3

trueTarget=1; falseTarget=-trueTarget; 

trainDataset = {}; targetDataset = {};
for i=1, dim do
     trainDataset[i]={torch.rand(input_number),  torch.rand(input_number)}
     if i%2==0 then targetDataset[i] = trueTarget
     else  targetDataset[i] = falseTarget 
     end
      -- print('targetDataset['..i..'] '..targetDataset[i]);
      -- sys.sleep(0.2)
end

for i=1, dim do
  for j=1, input_number do
     print(round(trainDataset[i][1][j],2)..' '..round(trainDataset[i][2][j],2));
  end
end

-- imagine we have one network we are interested in, it is called "perceptronUpper"
    perceptronUpper= nn.Sequential()
    print('input_number='..input_number..'\thiddenUnits='..hiddenUnits);
    perceptronUpper:add(nn.Linear(input_number, hiddenUnits))
    perceptronUpper:add(nn.Tanh())
    if dropOutFlag==TRUE then perceptronUpper:add(nn.Dropout()) end

    perceptronUpper:add(nn.Linear(hiddenUnits,output_layer_number))
    perceptronUpper:add(nn.Tanh())

    perceptronLower = perceptronUpper:clone('weight', 'gradWeight', 'gradBias', 'bias')

    parallel_table = nn.ParallelTable()
    parallel_table:add(perceptronUpper)
    parallel_table:add(perceptronLower)

    perceptron= nn.Sequential()
    perceptron:add(parallel_table)
    perceptron:add(nn.CosineDistance())

    max_iterations = MAX_ITERATIONS_CONST;
    learnRate = LEARNING_RATE_CONST;


    -- # TRAINING:
    for k=1, dim do
      print('\n[k='..k..'] gradientUpdate()');
      perceptron = gradientUpdate(perceptron, trainDataset[k], targetDataset[k], learnRate, max_iterations)
    end

The question is: why the predictionValue variable is always the same? Why doesn't it get updates?

EDIT: I now realized that the problem was that I was using only 1 output layer dimension. I moved it to 6, but unfortunately I have a new problem. The gradient is not updating in the right direction. For example, here's what happens by using my previous code with output_layer_number=6

i=1) predictionValue=0.99026757478549 target=-1
i=2) predictionValue=0.9972249767451 target=-1
i=3) predictionValue=0.95910828489725 target=-1
i=4) predictionValue=0.98960431921481 target=-1
i=5) predictionValue=0.9607511165448 target=-1
i=6) predictionValue=0.7774414068913 target=-1
i=7) predictionValue=0.78994300446018 target=-1
i=8) predictionValue=0.96893163039218 target=-1
i=9) predictionValue=0.99786687264848 target=-1
i=10) predictionValue=0.92254348014872 target=-1
i=11) predictionValue=0.84935926907926 target=-1
i=12) predictionValue=0.93696147024616 target=-1
i=13) predictionValue=0.93469525917962 target=-1
i=14) predictionValue=0.9584800936415 target=-1
i=15) predictionValue=0.99376832219916 target=-1
i=16) predictionValue=0.97381161559835 target=-1
i=17) predictionValue=0.94124227912993 target=-1
i=18) predictionValue=0.94947181918451 target=-1
i=19) predictionValue=0.9946839455962 target=-1
i=20) predictionValue=0.9637013147803 target=-1
i=21) predictionValue=0.94853981221519 target=-1
i=22) predictionValue=0.95441294067747 target=-1
i=23) predictionValue=0.99999485148281 target=-1
i=24) predictionValue=0.9900480694373 target=-1
i=25) predictionValue=0.99316158138794 target=-1

That is, the predictionValue never goes towards -1. Why?

1

There are 1 answers

1
deltheil On BEST ANSWER

why the predictionValue variable is always the same? Why doesn't it get updates?

First of all you should perform the backward propagation only if predictionValue*targetValue < 1 to make sure you back-propagate only if the pairs need to be pushed together (targetValue = 1) or pulled apart (targetValue = -1).

See also this torch/nn official example that illustrates this.

That being said you have only 1 output unit (output_layer_number = 1). That means that each branch of your siamese network produces a single scalar, resp. u and v. This pair of scalars are then compared by the cosine distance:

C(u,v) = cosine(u, v) = (u / |u|) x (v / |v|)

Note: this criterion can only take two values here: 1 or -1 (see below in blue).

When it is time to back-propagate you compute the derivatives of this criterion with respect to the inputs, i.e. dC/du and dC/dv. But these derivatives are null and undefined at 0 (see below in red):

enter image description here

This is why the back-propagation does nothing here, i.e. it remains static (and you can verify this in practice by printing out the norms of these derivatives).