I took the ML course on Coursera and modified one of the homeworks to build a "general purpose" neural network to use on my projects.
While I was testing the NN (5 inputs, 2 outputs) on the same dataset, tweaking the parameters I found out that by arbitrarily choosing a number of hidden units for the only hidden layer the F-score on the cross-validation/test set improves significantly.
For example with 1 hidden unit the F-score is ~0.79, with 2,3 and 4 is ~0.83 but if I suddenly increase it to 100 I get a perfect 1.0. At some point the minimum F-score I get is 0.99.
I'm sure there are no bugs in the code because the predictions reflect the F-score obtained (plus when I submitted it as homework there were no errors of any kind).
This thing is driving me crazy because as far as I know a "good practice" is to keep the number of hidden units between the number of inputs and outputs (in my case 5 to 2).
Do you have any idea/reference on why this happens? Is it just the more neurons you throw at it the better?
Thank you.
Link to the source code and sample data: https://github.com/mardurhack/NN_question_stackoverflow
I find the partitioning of that code excessive and hard to see the full picture, but I guess it has some significance on your course. I would like clarification on something and I can't yet make comments - what is your learning rule? I will adapt my answer if necessary based on clarification, but this is assuming Levenberg-Marquardt as the baseline approach.
In "train.m" you partition your data only into training and testing sets. That would therefore mean that you are lacking any validation data, and I very recently wrote a layman's description of this does in an answer here. The more hidden neurons you add, the more you allow your network to contort its output to match what you are trying to map the inputs to, and the more it is susceptible to overfitting. So I would expect the prediction accuracy to increase on your network with more neurons, but that in no way leads to the conclusion that you have created a "better" model.
I think this next part rather unlikely considering that it requires a relatively large number of nodes to get a decent answer - but it could be that the system response does actually follow a very simple set of rules (hinted at by a well trained model also doing well against test data in your question, even in the absence of validation data). In that case, as an engineer, I would be looking for a mechanistic model for this problem, not a phenomenological one as the end goal. I am not able to build my own model at the moment to look into this further.