C++ NN implementation only trains to uniform output value

115 views Asked by At

I'm currently working on a neural network implementation in C++ as a personal exploration to begin learning about deep neural networks and to help familiarise myself with some newer (C++20 & C++23) features.

I'm seeing a problem with my implementation where it always seems to end up trained to output a flat value. My assumption is that the issue potentially stems from a few points, for example how I am representing my weight matrices, calculating neuron outputs, and calculating gradients (or a combination of these issues).

Here's an example of the the networks output generated after attempting to train it on the function f(x) = 5\sin(x). The network has 3 hidden layers of 10 neurons and input & output layers of 1 neuron. It only supports a fixed learning rate at the moment, in this case 0.015.

f(x) = 5 * sin(x)                       prediction      cost:
f(0) = 0                        0.9996808639861019      (cost = 0.9993618298199992)
f(0.3) = 1.4776010333066978                     0.9996808639861019      (cost = 0.48955331142833797)
f(0.6) = 2.8232123669751767                     0.9996808639861019      (cost = 0.1597447930366769)
f(0.8999999999999999) = 3.9166345481374165                      0.9996808639861019      (cost = 0.009936274645015763)
f(1.2) = 4.660195429836131                      0.9996808639861019      (cost = 0.04012775625335459)
f(1.5) = 4.987474933020272                      0.9996808639861019      (cost = 0.25031923786169347)
f(1.7999999999999998) = 4.869238154390977                       0.9996808639861019      (cost = 0.640510719470032)
f(2.1) = 4.316046833244369                      0.9996808639861019      (cost = 1.2107022010783715)
f(2.4) = 3.3773159027557553                     0.9996808639861019      (cost = 1.9608936826867098)
f(2.6999999999999997) = 2.136899401169151                       0.9996808639861019      (cost = 2.891085164295048)
f(3) = 0.7056000402993361                       0.9996808639861019      (cost = 4.0012766459033875)
f(3.3) = -0.7887284707162411                    0.9996808639861019      (cost = 5.291468127511726)
f(3.5999999999999996) = -2.2126022164742603                     0.9996808639861019      (cost = 6.761659609120064)
f(3.9) = -3.438830795919869                     0.9996808639861019      (cost = 8.411851090728403)
f(4.2) = -4.357878862067941                     0.9996808639861019      (cost = 10.242042572336745)
f(4.5) = -4.887650588325485                     0.9996808639861019      (cost = 12.252234053945083)
f(4.8) = -4.980823044179203                     0.9996808639861019      (cost = 14.44242553555342)
f(5.1) = -4.629073411638663                     0.9996808639861019      (cost = 16.812617017161756)
f(5.3999999999999995) = -3.8638224377799384                     0.9996808639861019      (cost = 19.362808498770093)
f(5.7) = -2.753427712988188                     0.9996808639861019      (cost = 22.09299998037844)

On the other hand, the network can learn to represent an OR gate fine on a single neuron & around 1000 epochs:

Avg. cost change: 0.980538410283458 -> 0.00027972743666964956
OR:
0 || 0 -> 0 (0.017760015096961585)
0 || 1 -> 1 (0.9802997012073817)
1 || 0 -> 1 (0.9796199901202844)
1 || 1 -> 1 (0.9997877849709494)

It also seems to be on the way to the correct decision boundary for an AND gate, except more slowly:

Avg. cost change: 0.5385977455525776 -> 0.07853099914207844
AND:
0 || 0 -> 0 (-0.2521620532406275)
0 || 1 -> 0 (0.2521620532429756)
1 || 0 -> 0 (0.2521620532429756)
1 || 1 -> 1 (0.6487637642805655)

All of the linear algebra is performed using the Eigen3 library (v3.4). The code can be found on my github here, it's not exactly clean due to my.

At the moment, the weight matrix is represented using an Eigen matrix. The matrices for each layer are stored in std::vector. Each matrix is shaped so that the each column represents the weights for a given neuron, each row is the weight for each neuron in the previous layer for each neuron, & there is an extra row at the bottom for the bias weights.

Example weight matrix for a layer with 2 neurons & 2 inputs:

w00 w01 <- weights for input 1

w10 w11 <- weights for input 2

b0 b1

^ ^

neuron 1 neuron 2

Example input to above weight matrix:

x0 x1 x2

The first two inputs are inputs from neurons, the third (x2) is always 1 to represent a constant bias.

I should also note that I had this issue before adding any kind of bias to the network at all.

A std::vector of Eigen matrices representing the output of each layer on the last forward pass is also stored.

The forward propagation is handled as follows:

// NeuralNetwork class is templated to take:
//  - std::floating_point
//  - std::complex (Hoping to test out complex valued networks @ somepoint)

// The function takes an Eigen matrix of inputs, each row is a set of inputs
// Create new input matrix w/ extra column for bias inputs (always 1)
Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic> input( inputs.rows(),
                                                        inputs.cols() + 1 );
// Check shapes, then construct new input
// from function input & bias vector
assert( input.cols() == m_network.front().rows() );
const auto bias{ layer_t<T>::Constant( input.rows(), 1, 1. ) };
input << inputs, bias;

// Set first layer of outputs matrix to input
// mostly irrelevant as input weights aren't adjusted
m_intermediate_state[0] = input;

// view into weight vector, output vector, & vector of layer activation functions
// Drops the 1st index which corresponds to the input layer
const auto view = std::views::zip( m_network, m_intermediate_state,
                                   m_activation_functions )
                  | std::views::drop( 1 );

for ( const auto & layers : view ) {
    auto & [network_layer, output_layer, act_func] = layers;

    // Set current output layer to correct shape filled w/ 1.
    output_layer = layer_t<T>::Constant(
        input.rows(), network_layer.cols() + 1, static_cast<T>( 1. ) );

    // Set all except rightmost column = result of applying the
    // activation function to result of matrix multiplication
    // between input & current weight matrix.
    output_layer.leftCols( output_layer.cols() - 1 )
        << act_func( input * network_layer );

    input = output_layer;
}

return input.leftCols(input.cols() - 1);

The backward propagation is handled as follows:

// The backward_pass function takes a set of labels as an input

// New 'labels' matrix w/ extra column for bias & check shape
layer_t<T> expected_output{ layer_t<T>::Constant(
    labels.rows(), labels.cols() + 1, static_cast<T>( 1. ) ) };
expected_output.leftCols( labels.cols() ) << labels;
assert( expected_output.rows() == m_intermediate_state.back().rows()
        && expected_output.cols() == m_intermediate_state.back().cols() );

// Sliding view into 2 layers at a time in reverse
auto layer_info_view{ std::views::zip( m_network, m_intermediate_state,
                                       m_activation_gradients )
                      | std::views::reverse | std::views::slide( 2 ) };

// Will use i to represent index of current layer
for ( const auto & layers : layer_info_view ) {
    // References to relevant layers
    // Note nxt_layer is the next layer in the backpropagation,
    // but the previous layer in the network (i - 1)
    const auto & cur_layer = layers[0];
    const auto & nxt_layer = layers[1];
    const auto & [nxt_weights, nxt_output, nxt_gradient] = nxt_layer;
    const auto & [cur_weights, cur_output, cur_gradient] = cur_layer;

    // Cost between previous layer (i + 1) & current
    const auto d_cost{ m_cost_gradient( cur_output, expected_output ) };
    // Act. func. grad. for outputs of current layer
    const auto d_cur_output{ cur_gradient( cur_output ) };
    // Coefficient-wise multiply cost grad. & act. func. grad.
    const auto gradients{ d_cost.cwiseProduct( d_cur_output ) };
    // Current weight matrix (i) gradients
    const auto d_weights{ nxt_output.transpose() * gradients };
    // Calculate new weights
    const auto new_weights{
        cur_weights - m_eta * d_weights.leftCols( cur_weights.cols() )
    };

    // Update current layer's weights
    cur_weights << new_weights;
    // Update input gradients for next layer
    expected_output = gradients.leftCols( cur_weights.cols() )
                      * cur_weights.transpose();
}

If you have any advice/suggestions, or need more info, just ask :).

Thanks for any help you can give.

--- EDIT ---

Adding the source code below to make reproducing the issue easier

neural_util.hpp

#pragma once

#include "Eigen/Dense"

#include <complex>
#include <functional>
#include <utility>
#include <vector>

namespace neural
{

template <typename T>
struct is_complex : std::false_type
{};
template <std::floating_point T>
struct is_complex<std::complex<T>> : std::true_type
{};

template <typename T>
concept Weight = std::floating_point<T> || is_complex<T>::value;

template <Weight T>
using layer_t = Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic>;
template <Weight T>
using network_t = std::vector<layer_t<T>>;

template <Weight T>
using output_layer_t = layer_t<T>;
template <Weight T>
using output_network_t = std::vector<layer_t<T>>;

template <Weight T>
using bias_layer_t = layer_t<T>;
template <Weight T>
using bias_network_t = std::vector<layer_t<T>>;

template <Weight T>
using function_t = std::function<const layer_t<T>( const layer_t<T> & )>;
template <Weight T>
using cost_function_t =
    std::function<layer_t<T>( const layer_t<T> &, const layer_t<T> & )>;

} // namespace neural

neural_activation.hpp

#pragma once

#include "neural_util.hpp"

#include <algorithm>
#include <format>   // TODO: Remove
#include <iostream> // TODO: Remove
#include <numeric>
#include <ranges>

namespace neural
{

namespace activation
{

template <Weight T>
constexpr inline layer_t<T>
linear( const layer_t<T> & m ) {
    return m;
}

template <Weight T>
constexpr inline layer_t<T>
sigmoid( const layer_t<T> & m ) {
    return m.unaryExpr( []( const T x ) {
        return static_cast<T>( 1 ) / ( static_cast<T>( 1 ) + std::exp( -x ) );
    } );
}

template <Weight T>
constexpr inline layer_t<T>
d_sigmoid( const layer_t<T> & m ) {
    return m.unaryExpr(
        []( const T x ) { return x * ( static_cast<T>( 1 ) - x ); } );
}

template <Weight T>
constexpr inline layer_t<T>
relu( const layer_t<T> & m ) {
    return m.unaryExpr(
        []( const T x ) { return std::max( x, static_cast<T>( 0 ) ); } );
}

template <Weight T>
constexpr inline layer_t<T>
d_relu( const layer_t<T> & m ) {
    return m.unaryExpr( []( const T x ) {
        return static_cast<T>( ( x > static_cast<T>( 0 ) ) ? 1 : 0 );
    } );
}

template <Weight T>
constexpr inline layer_t<T>
tanh( const layer_t<T> & m ) {
    return m.unaryExpr( []( const T x ) {
        const auto e_x{ std::exp( x ) }, e_minus_x{ std::exp( -x ) };
        return ( e_x - e_minus_x ) / ( e_x + e_minus_x );
    } );
}

template <Weight T>
constexpr inline layer_t<T>
d_tanh( const layer_t<T> & m ) {
    return m.unaryExpr( []( const T x ) {
        const auto e_x{ std::exp( x ) }, e_minus_x{ std::exp( -x ) };
        const auto tanh{ ( e_x - e_minus_x ) / ( e_x + e_minus_x ) };
        return static_cast<T>( 1. ) - std::pow( tanh, 2 );
    } );
}

} // namespace activation

namespace cost
{

template <Weight T>
constexpr inline layer_t<T>
SSR( const layer_t<T> & predictions, const layer_t<T> & labels ) {
    return ( labels - predictions ).unaryExpr( []( const T x ) {
        return std::pow( x, 2 );
    } );
}

template <Weight T>
constexpr inline layer_t<T>
d_SSR( const layer_t<T> & predictions, const layer_t<T> & labels ) {
    return static_cast<T>( -2 ) * ( labels - predictions );
}

} // namespace  cost

} // namespace neural

neural_network.hpp

#pragma once // NEURAL_NETWORK_HPP

#include "Eigen/Core"
#include "neural_activation.hpp"
#include "neural_util.hpp"

#include <algorithm>
#include <cassert>
#include <cmath>
#include <complex>
#include <concepts>
#include <cstdint>
#include <cstdlib>
#include <ctime>
#include <format>
#include <functional>
#include <iostream>
#include <numeric>
#include <ranges>
#include <vector>



namespace neural
{

template <Weight T>
class NeuralNetwork
{
    private:
    std::uint64_t              m_n_inputs;
    std::uint64_t              m_n_outputs;
    std::uint64_t              m_n_layers;
    std::vector<std::uint64_t> m_neurons_per_layer;
    std::vector<function_t<T>> m_activation_functions;
    std::vector<function_t<T>> m_activation_gradients;
    cost_function_t<T>         m_cost_function;
    cost_function_t<T>         m_cost_gradient;
    T                          m_eta;

    network_t<T>        m_network;
    output_network_t<T> m_intermediate_state;

    public:
    NeuralNetwork( const std::vector<std::uint64_t> & neurons_per_layer,
                   const std::vector<function_t<T>> & activation_functions,
                   const std::vector<function_t<T>> & activation_gradients,
                   const cost_function_t<T> & cost_function = cost::SSR<T>,
                   const cost_function_t<T> & cost_gradient = cost::d_SSR<T>,
                   const T                    eta = static_cast<T>( 0.01 ),
                   const std::time_t          seed = 1 ) :
        m_n_inputs( neurons_per_layer.front() ),
        m_n_outputs( neurons_per_layer.back() ),
        m_n_layers( neurons_per_layer.size() ),
        m_neurons_per_layer( neurons_per_layer ),
        m_cost_function( cost_function ),
        m_cost_gradient( cost_gradient ),
        m_eta( eta ) {
        // Must be an activation function for each layer (except the input
        // layer)
        m_activation_functions = std::vector<function_t<T>>{};
        m_activation_functions.push_back( activation::linear<T> );
        m_activation_functions.insert( m_activation_functions.end(),
                                       activation_functions.cbegin(),
                                       activation_functions.cend() );
        m_activation_gradients = std::vector<function_t<T>>{};
        m_activation_gradients.push_back( activation::linear<T> );
        m_activation_gradients.insert( m_activation_gradients.end(),
                                       activation_gradients.cbegin(),
                                       activation_gradients.cend() );
        assert( m_activation_functions.size() == m_n_layers );

        // Seed random number generator
        std::srand( static_cast<unsigned>( seed ) );

        // Initializing network, bisaes & intermediate_state in the same
        // shape for back propagation using the outputs of each layer
        m_network = network_t<T>( m_n_layers );
        m_intermediate_state = output_network_t<T>( m_n_layers );

        /*
        auto layer_info{ std::views::zip( m_network, m_intermediate_state,
                                   m_activation_gradients )
                  | std::views::reverse };
         auto layer_info_view{ layer_info | std::views::slide( 2 ) };
         */

        // Initializing layers
        /*const auto layer_info{ std::views::zip( m_neurons_per_layer,
        m_network, m_intermediate_state ) }; for ( const auto & layer_data :
        layer_info | std::views::slide( 2 ) ) { const auto & prev_layers =
        layer_data[0]; const auto & cur_layers = layer_data[1];

            const auto & [prev_n_neurons, prev_layer, prev_outputs] =
                prev_layers;
            const auto & [cur_n_neurons, cur_layer, cur_outputs] = cur_layers;

            cur_layer = layer_t<T>::Random( prev_n_neurons + 1, cur_n_neurons );
        }*/
        std::uint64_t prev_layer_sz{ m_n_inputs };

        for ( std::uint64_t i{ 0 }; i < m_n_layers; ++i ) {
            const auto n_neurons{ m_neurons_per_layer[i] };
            m_network[i] = Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic>(
                prev_layer_sz + 1, n_neurons );
            m_network[i].setRandom();
            prev_layer_sz = n_neurons;
        }
    }

    [[nodiscard]] constexpr inline auto n_inputs() const noexcept {
        return m_n_inputs;
    }
    [[nodiscard]] constexpr inline auto n_outputs() const noexcept {
        return m_n_outputs;
    }
    [[nodiscard]] constexpr inline auto n_layers() const noexcept {
        return m_n_layers;
    }
    [[nodiscard]] constexpr inline auto shape() const noexcept {
        return m_neurons_per_layer;
    }
    [[nodiscard]] constexpr inline auto activation_funcs() const noexcept {
        return m_activation_functions;
    }
    [[nodiscard]] constexpr inline auto cost_func() const noexcept {
        return m_cost_function;
    }
    [[nodiscard]] constexpr inline auto cost_gradient() const noexcept {
        return m_cost_gradient;
    }
    [[nodiscard]] constexpr inline auto & network() const noexcept {
        return m_network;
    }
    [[nodiscard]] constexpr inline auto & intermediate_state() const noexcept {
        return m_intermediate_state;
    }
    [[nodiscard]] constexpr inline auto &
    layer( const std::uint64_t i ) const noexcept {
        return m_network.at( i );
    }

    [[nodiscard]] constexpr inline Eigen::Matrix<T, Eigen::Dynamic,
                                                 Eigen::Dynamic>
    forward_pass(
        const Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic> & inputs,
        const std::size_t V = 0 ) noexcept;
    [[nodiscard]] constexpr inline auto
    backward_pass( const layer_t<T> & labels,
                   const std::size_t  V = 0 ) noexcept;
};

template <Weight T>
[[nodiscard]] constexpr inline Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic>
NeuralNetwork<T>::forward_pass(
    const Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic> & inputs,
    const std::size_t                                        V ) noexcept {
    Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic> input( inputs.rows(),
                                                            inputs.cols() + 1 );
    assert( input.cols() == m_network.front().rows() );
    const auto bias{ layer_t<T>::Constant( input.rows(), 1, 1. ) };
    input << inputs, bias;
    m_intermediate_state[0] = input;
    const auto view = std::views::zip( m_network, m_intermediate_state,
                                       m_activation_functions )
                      | std::views::drop( 1 );
    if ( V )
        std::cout << "Forward pass:\n";
    for ( const auto & [i, layers] : view | std::views::enumerate ) {
        auto & [network_layer, output_layer, act_func] = layers;

        output_layer = layer_t<T>::Constant(
            input.rows(), network_layer.cols() + 1, static_cast<T>( 1. ) );

        output_layer.leftCols( output_layer.cols() - 1 )
            << act_func( input * network_layer );

        // TODO: Remove when working.
        if ( V ) {
            std::cout << i << "\n";
            std::cout << std::format( "input ({}, {})\n", input.rows(),
                                      input.cols() )
                      << input << "\n";
            if ( V > 1 ) {
                std::cout << std::format( "network_layer ({}, {})\n",
                                          network_layer.rows(),
                                          network_layer.cols() )
                          << network_layer << "\n";
            }
            std::cout << std::format( "output_layer ({}, {})\n",
                                      output_layer.rows(), output_layer.cols() )
                      << output_layer << "\n";
        }
        input = output_layer;
    }
    if ( V ) {
        std::cout << std::format( "output ({}, {})\n", input.rows(),
                                  input.cols() )
                  << input << "\n";
        std::cout << "Done.\n" << std::endl;
    }

    return input.leftCols( input.cols() - 1 );
}

template <Weight T>
[[nodiscard]] constexpr inline auto
NeuralNetwork<T>::backward_pass( const layer_t<T> & labels,
                                 const std::size_t  V ) noexcept {
    layer_t<T> expected_output{ layer_t<T>::Constant(
        labels.rows(), labels.cols() + 1, static_cast<T>( 1. ) ) };
    expected_output.leftCols( labels.cols() ) << labels;
    assert( expected_output.rows() == m_intermediate_state.back().rows()
            && expected_output.cols() == m_intermediate_state.back().cols() );
    auto layer_info_view{ std::views::zip( m_network, m_intermediate_state,
                                           m_activation_gradients )
                          | std::views::reverse | std::views::slide( 2 ) };
    if ( V )
        std::cout << "Backward pass:\n";

    for ( const auto & [i, layers] : layer_info_view | std::views::enumerate ) {
        // References to relevant layers
        const auto & cur_layer = layers[0];
        const auto & nxt_layer = layers[1];
        const auto & [nxt_weights, nxt_output, nxt_gradient] = nxt_layer;
        const auto & [cur_weights, cur_output, cur_gradient] = cur_layer;

        const auto d_cost{ m_cost_gradient( cur_output, expected_output ) };
        const auto d_cur_output{ cur_gradient( cur_output ) };
        const auto gradients{ d_cost.cwiseProduct( d_cur_output ) };
        const auto d_weights{ nxt_output.transpose() * gradients };
        // const auto new_weights{
        //     cur_weights - m_eta * d_weights.leftCols( d_weights.cols() - 1 )
        // };
        const auto new_weights{
            cur_weights - m_eta * d_weights.leftCols( cur_weights.cols() )
        };

        // TODO: Remove when working.
        if ( V ) {
            if ( V > 1 ) {
                std::cout << std::format( "cur_weights ({}, {})\n",
                                          cur_weights.rows(),
                                          cur_weights.cols() )
                          << cur_weights << "\n";
                std::cout << std::format( "new_weights ({}, {})\n",
                                          new_weights.rows(),
                                          new_weights.cols() )
                          << new_weights << "\n";
            }
            if ( V > 2 ) {
                std::cout << std::format( "d_cost ({}, {})\n", d_cost.rows(),
                                          d_cost.cols() )
                          << d_cost << std::endl;
            }
            if ( V > 3 ) {
                std::cout << std::format( "cur_output ({}, {})\n",
                                          cur_output.rows(), cur_output.cols() )
                          << cur_output << std::endl;
                std::cout << std::format( "expected_output ({}, {})\n",
                                          expected_output.rows(),
                                          expected_output.cols() )
                          << expected_output << std::endl;
            }
            std::cout << "Weight diff:\n" << new_weights - cur_weights << "\n";
        }

        // Update current layer's weights
        cur_weights << new_weights;
        // Update input gradients for next layer
        // expected_output = gradients.leftCols( gradients.cols() - 1 )
        expected_output =
            gradients.leftCols( cur_weights.cols() ) * cur_weights.transpose();
    }
    if ( V )
        std::cout << "Done.\n" << std::endl;

    // std::cout << std::endl;
}
} // namespace neural

main.cpp

// #include "Eigen/Core"
#include "Eigen/Dense"
#include "neural_network.hpp"

#include <iostream>
#include <ranges>

using namespace neural;
using namespace neural::activation;

template <Weight T>
constexpr inline T
f( const T x ) {
    return 5 * std::sin( x );
}

template <Weight T>
constexpr inline layer_t<T>
gen_f_data( const T xmin, const T xmax, const std::size_t N ) {
    auto    result = layer_t<T>( N, 1 );
    const T dx{ ( xmax - xmin ) / static_cast<T>( N ) };
    for ( std::size_t i{ 0 }; i < N; ++i ) {
        const T xval{ xmin + dx * static_cast<T>( i ) };
        result.row( static_cast<Eigen::Index>( i ) ) << xval;
    }
    return result;
}

template <Weight T>
constexpr inline T
OR( const T x1, const T x2 ) {
    return ( x1 || x2 ) ? 1. : 0.;
}

template <Weight T>
constexpr inline T
AND( const T x1, const T x2 ) {
    return ( x1 && x2 ) ? 1. : 0.;
}

template <Weight T>
constexpr inline layer_t<T>
gen_f_labels( const layer_t<T> & inputs ) {
    auto result = layer_t<T>( inputs.rows(), 1 );
    for ( Eigen::Index i{ 0 }; i < inputs.rows(); ++i ) {
        result( i, 0 ) = f( inputs( i, 0 ) );
    }
    return result;
}

int
main() {
    std::cout << "Creating NeuralNetwork:" << std::endl;
    neural::NeuralNetwork<double> test(
        { 2, 1, 1 }, { tanh<double>, tanh<double> },
        { d_tanh<double>, d_tanh<double> }, cost::SSR<double>,
        cost::d_SSR<double>, 0.015 );
    std::cout << "Done." << std::endl;

    const auto f_input = gen_f_data<double>( 0., 1., 10 );
    const auto f_labels = gen_f_labels<double>( f_input );

    auto OR_samples{ layer_t<double>( 4, 2 ) };
    auto AND_samples{ layer_t<double>( 4, 2 ) };
    OR_samples << 0, 0, 0, 1, 1, 0, 1, 1;
    AND_samples << 0, 0, 0, 1, 1, 0, 1, 1;

    auto OR_labels{ layer_t<double>( 4, 1 ) };
    auto AND_labels{ layer_t<double>( 4, 1 ) };
    OR_labels << 0, 1, 1, 1;
    AND_labels << 0, 0, 0, 1;

    const auto inputs{ OR_samples };
    const auto labels{ OR_labels };

    const std::size_t epochs{ 10 };

    const auto initial_cost{ cost::SSR<double>( labels,
                                                test.forward_pass( inputs ) ) };
    const auto initial_avg_cost{
        initial_cost.sum()
        / static_cast<double>( initial_cost.rows() * initial_cost.cols() )
    };

    std::cout << "Training..." << std::endl;
    for ( std::size_t i{ 0 }; i < epochs; ++i ) {
        std::cout << "epoch " << i + 1 << ":\n";

        const auto epoch_initial_cost{ cost::SSR<double>(
            labels, test.forward_pass( inputs ) ) };

        const layer_t<double> output = test.forward_pass( inputs, 2 );
        test.backward_pass( labels, 4 );

        const auto final_cost{ cost::SSR<double>(
            labels, test.forward_pass( inputs ) ) };

        const auto epoch_initial_avg_cost{ epoch_initial_cost.sum()
                                           / static_cast<double>(
                                               epoch_initial_cost.rows()
                                               * epoch_initial_cost.cols() ) };
        const auto final_avg_cost{
            final_cost.sum()
            / static_cast<double>( final_cost.rows() * final_cost.cols() )
        };

        std::cout << std::format( "Epoch avg. cost change: {} -> {}",
                                  epoch_initial_avg_cost, final_avg_cost )
                  << std::endl;
        std::cout << std::format( "Avg. cost change: {} -> {}",
                                  initial_avg_cost, final_avg_cost )
                  << std::endl;
    }
    std::cout << "Done." << std::endl;

    std::cout << "OR:" << std::endl;
    for ( Eigen::Index i{ 0 }; i < OR_samples.rows(); ++i ) {
        std::cout << std::format(
            "{} || {} -> {}\t({})\n", OR_samples( i, 0 ), OR_samples( i, 1 ),
            OR_labels( i, 0 ),
            test.forward_pass( OR_samples.row( i ) )( 0, 0 ) );
    }
}
1

There are 1 answers

0
Ben Andrews On

I've now solved the part of the problem the question was about, and realised I hadn't updated. If you look at the NeuralNetwork::backward_pass function you can see that the variable d_cost is recalculated at each layer boundary, this is incorrect. As a result, anytime more than a single layer of back propagation was required the weight updates would go horribly wrong.

Instead, d_cost should be calculated once before the back propagation begins from the final output layer and the labels. Then the gradients for the next layer are passed on as they are. With the d_cost recalculation where it was, at each layer a new cost gradient was calculated between the output for the current layer and the gradient passed from the previous layer.

The backward_pass function now looks like:

template <Weight T>
constexpr inline void
NeuralNetwork<T>::backward_pass( const layer_t<T> & labels ) noexcept {
    // Alter labels to include 1 for bias layer, not sure this is correct
    layer_t<T> expected_output{ layer_t<T>::Constant(
        labels.rows(), labels.cols() + 1, static_cast<T>( 1. ) ) };
    expected_output.leftCols( labels.cols() ) << labels;

    assert( expected_output.rows() == m_intermediate_state.back().rows()
            && expected_output.cols() == m_intermediate_state.back().cols() );

    // View into network layers, forward pass layer outputs, activation function
    // gradients
    auto layer_info_view{ std::views::zip( m_network, m_intermediate_state,
                                           m_activation_gradients )
                          | std::views::reverse | std::views::slide( 2 ) };

    // Calculate initial cost gradient
    auto d_cost{ m_cost_gradient( m_intermediate_state.back(),
                                  expected_output ) };
    for ( const auto & [i, layers] : layer_info_view | std::views::enumerate ) {
        // References to relevant layers
        const auto & cur_layer = layers[0];
        const auto & nxt_layer = layers[1];
        const auto & [nxt_weights, nxt_output, nxt_gradient] = nxt_layer;
        const auto & [cur_weights, cur_output, cur_gradient] = cur_layer;

        const auto d_cur_output{ cur_gradient( cur_output ) };
        const auto gradients{ d_cost.cwiseProduct( d_cur_output ) };
        const auto d_weights{ nxt_output.transpose() * gradients };
        const auto new_weights{
            cur_weights - m_eta * d_weights.leftCols( cur_weights.cols() )
        };

        // Update current layer's weights
        cur_weights << new_weights;
        // Update input gradients for next layer
        d_cost =
            gradients.leftCols( cur_weights.cols() ) * cur_weights.transpose();
    }
}

As a result, it can now correctly learn the mildly more complex XOR logic gate function with multiple layers. Similarly, if I arbitrarily increase the number of layers the decision boundary continues to be learnt.

Unfortunately, there are definitely still issues. It seems to remain unable to learn more complex functions such as sin(x) or even x^2, although the root issue remains unknown. It may be an actual logic issue, or this may resolve itself as I begin implementing other optimisations such as learning rate schedulers, experiment with different cost functions, etc.