Neural Network with Evolution Strategies optimizer keeps outputting the same accuracy on MNIST - Pytorch

92 views Asked by At

My task is to create a ANN with an Evolution Strategies algorithm as the optimizer (no derivation). The dataset I am using is MNIST. For now, I am just trying to implement this with a Linear ANN.

I found a Colab notebook that does this exact thing, but on the sklearn "make_moons" dataset. I tried to incorporate what was on the notebook, and the code runs with no problems; yet it outputs the same accuracy. Usually the first few outputs are different, then it "converges" at 0.0987 in the training set and 0.098 in the test set. Additionally, it takes super long to train. Maybe there are redundant iterations?

Colab Notebook, if you want to check it out: https://colab.research.google.com/drive/1SY38Evy4U9HfUDkofPZ2pLQzEnwvYC81?usp=sharing

I tried trying some StackOverflow recommendations, such as adjusting the hyperparameters (learning rate, hidden units), as well as using Leaky ReLu in case of a "dying ReLu"; none of them worked. This leads me to believe that the problem is in the ES optimizer.

I am new to Pytorch, so if any glaring malpractices are there, please say so!

  # imports
import torch
import torch.nn as nn
from tqdm.notebook import tqdm
import numpy as np
from sklearn.model_selection import train_test_split
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x = np.concatenate((x_train, x_test))
y = np.concatenate((y_train, y_test))

train_size = 0.7
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=train_size)
X_train, X_test, y_train, y_test = torch.FloatTensor(X_train), torch.FloatTensor(X_test), torch.LongTensor(y_train), torch.LongTensor(y_test)

X_train = X_train.reshape(X_train.shape[0], -1)
X_test = X_test.reshape(X_test.shape[0], -1)

def weights_init(m):
  classname = m.__class__.__name__
  if classname.find('Linear') != -1:
    m.weight.data.normal_(0.0, 0.00)

model = nn.Sequential(
        nn.Linear(784, 200),
        nn.ReLU(),
        nn.Linear(200, 50),
        nn.ReLU(),
        nn.Linear(50, 10),
        nn.ReLU(),
    )

model = model.float()
model.apply(weights_init)

mother_parameters = model.parameters()
mother_vector = nn.utils.parameters_to_vector(mother_parameters)

# Now, for the hyperparameters
SIGMA = 0.1
LR = 0.01
POPULATION_SIZE=50
ITERATIONS = 100

# Fitness function
loss_func = nn.CrossEntropyLoss()
def loss(y_pred, y_true):
  return 1/loss_func(y_pred, y_true) # we are maximizing the loss in ES, so take the reciprocal
  # now, increasing loss means the model is learning

def fitness_func(solution):
  # solution is a vector of paramters like mother_parametrs
  nn.utils.vector_to_parameters(solution, model.parameters())
  return loss(model(X_train), y_train) + 0.00000001

# in ES, our population is a slightly altered version of the mother parameters, so we implement a jitter function
def jitter(mother_params, state_dict):
  params_try = mother_params + SIGMA*state_dict
  return params_try

# now, we calculate the fitness of entire population
def calculate_population_fitness(pop, mother_vector):
  fitness = torch.zeros(pop.shape[0])
  for i, params in enumerate(pop):
    p_try = jitter(mother_vector, params)
    fitness[i] = fitness_func(p_try)
  return fitness

def test(mother_params):
  nn.utils.vector_to_parameters(mother_params, model.parameters())
  return (((torch.max(model(X_test), 1)[1] == y_test).sum())/len(y_test)).item()
  # calculates the test accuracy of model

n_params = nn.utils.parameters_to_vector(model.parameters()).shape[0]
print(f"Number of params: {n_params}")

# Now, we train the model
with torch.no_grad(): #autograd makes it slower + takes more memory. We dont use differnetiation in ES
  for iteration in tqdm(range(ITERATIONS)):
    pop = torch.from_numpy(np.random.randn(POPULATION_SIZE, n_params)).float()
    fitness = calculate_population_fitness(pop, mother_vector)
    # normalize the fitness
    normalized_fitness = (fitness - torch.mean(fitness)) / torch.std(fitness)
    # update mother vector with the fitness values
    print(fitness.mean(), fitness.std())
    mother_vector = mother_vector + (LR / (POPULATION_SIZE * SIGMA)) * torch.matmul(pop.t(), normalized_fitness)
    reward = fitness_func(mother_vector)
    acc = test(mother_vector)
    print(f"Iteration: {iteration}, Reward:{reward:.3f}, Accuracy: {acc:.3f}")
1

There are 1 answers

2
maxy On

The most obvious problem is that you only evaluate your model once (in the line scores = model(data)) before you start to loop over the population.

You need to update and evaluate the model for every perturbation of the "mother" vector.

Or in other words, the function calculate_population_fitness(pop, mother_vector, scores, targets) creates a result that depends only on scores and targets, both of which are constant within your loop over the population.

I suggest to take this a bit slower. First, try to write the code such that it evaluates a population only a single time (a single generation) without updating anything. Print the std/mean of everything, especially of the fitness. If the fitness values are all the same, try increasing your sigma until you see different fitness values. If that doesn't work debug your program further before adding the update rule.

Better Initialization

This is section is a bit more advanced - don't worry until you got some learning going.

Good initialization is extremely important, especially as the neural nets get deeper. For an ES, initialization is somewhat similar to sampling the population of the first generation. (In fact, an ES may also simply initialize the mother_vector to zero, maybe something worth trying.) The changes to each parameter must be in the right range. You may need to scale them depending on the parameter, so that a single step-size sigma makes sense for all of them. What is right depends on whether you are changing a bias or a weight, and which activation you use (tanh or relu) and whether you use any kind of normalization, and how many inputs a neuron has (fan-in from the previous layer). So just simply doing vector_to_parameters() may not be optimal, because it throws away the structure information.

This is something PyTorch will probably solve this perfectly for you behind your back for the initial parameters. But since you're changing the parameters yourself, you need to know what a good step-size is for each of them (for best results).

As a baseline to compare to, I would suggest you just evaluate some 1000 randomly initialized models (initialized by PyTorch). If your ES is doing anything sensible, it should be able to beat the best result of that baseline.