I'm currently trying to build an algorithm to maximize terminal wealth of a portfolio. I am using the REINFORCE with baseline algorithm present in Sutton and Barto (2018). I have one neural network for the policy, which has current wealth and time left on investment horizon as inputs, and outputs two values: a mean and standard deviation of a normal distribution. The discounted dollar amount invested in the risky asset is then sampled from this distribution. I have another network for the value function (same inputs but outputs state value). I have solved the problem analytically and my value network converges to the optimal solution well. My policy network does not which leads me to believe that I could improve the architecture of the network to 'help' it find the optimal solution. I am reasonably new to pytorch and neural networks and so would appreciate ideas as to how i could do this. My policy network is below, it has two hidden layers with 32 nodes each. I have also played around with the learning rates and it does not seem to help too much. Thanks!
class PolicyNetwork(nn.Module):
    ''' Neural Network for the policy, which is taken to be normally distributed hence
    this network returns a mean and variance '''
    def __init__(self, lr, input_dims, fc1_dims, fc2_dims, n_returns):
        super(PolicyNetwork, self).__init__()
        self.input_dims = input_dims
        self.fc1_dims = fc1_dims
        self.fc2_dims = fc2_dims
        self.n_returns = n_returns
        self.lr = lr
        self.fc1 = nn.Linear(*self.input_dims, self.fc1_dims) # inputs should be wealth and time to maturity
        self.fc2 = nn.Linear(self.fc1_dims,self.fc2_dims)
        self.fc3 = nn.Linear(self.fc2_dims,n_returns) # returns mean and sd of normal dist
        self.optimizer = optim.Adam(self.parameters(), lr = lr)
        
    def forward(self, observation):
        state = torch.Tensor(observation).float().unsqueeze(0)
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        first_slice = x[:,0]
        second_slice = x[:,1]
        tuple_of_activated_parts = (
                first_slice, # let mean be negative
                #F.relu(first_slice), # make sure mean is positive
                #torch.sigmoid(second_slice) # make sure sd is positive
                F.softplus(second_slice) # make sd positive but dont trap below 1
                )
        out = torch.cat(tuple_of_activated_parts, dim=-1)
        return out