I've been coding a Neural Network for recognizing and tagging parts of speech in English (written in Java). The code itself has no 'errors' or apparent flaws. Nevertheless, it is not learning -- the more I train it does not change its ability to predict the testing data. The following is information about what I've done, please ask me to update this post if I left something important out.
I wrote the neural network and tested it on several different problems to make sure that the network itself worked. I trained it to learn how to double numbers, XOR, cube numbers, and learn the sin function to a decent accuracy. So, I'm fairly confident that the actual algorithm is working.
The network using using the sigmoid activation function. The learning rate is .3, Momentum is .6. The weights are initialized to rng.nextFloat() -.5) * 4
I then got the Brown Corpus data-set and simplified the tagset to 'universal' with NLTK. I used NLTK for generating and saving all the corpus and dictionary data. I cut the last 15,000 sentences out of the corpus for testing purposes. I used the rest of the corpus (about 40,000 sentences of tagged words) for training.
The neural network layout is as follows: There is an input neuron for each Tag. Output Layer: There is one output neuron for each tag. The network is taking inputs for 3 words: first: the word coming before the word we want to tag, second: the word that needs to be tagged, third: the word that follows the second word. So, total number of inputs are 3x(total number of possible tags). The input values are numbers between 0 and 1. Each of the 3 words being fed into the input layer is searched for in a dictionary (made up by the 40,000 corpus, the same corpus that is used for training). The dictionary holds the number of times that each word has been tagged in the corpus as what part of speech.
For instance, the word 'cover' is tagged as a noun 1 time and a verb 3 times.
Percentages of being tagged are computed for each part of speech that the word is associated as, and this is what is fed into the network for that particular word. So, the input neuron designated as NOUN would receive .33 and VERB would receive .66. The other input neurons that hold tags for that word receive an input of 0.0. This is done for each of the 3 words to be inputted. If a word is the first word of a sentence, the first group of tags are all 0. If a word is the last word of a sentence, the final group of input neurons that hold the tag probabilities for the following word are left as 0s. I've been using 10 hidden nodes (I've read a number of papers and this seems to be a good place to start testing with)
None of the 15,000 testing sentences were used to make the 'dictionary.' So, when testing the network with this partial corpus there will be some words the network has never seen. Words that are not recognized have their suffix stripped, and their suffix is searched for in another 'dictionary.' Whatever is most probable for that word is then used as inputs for it.
This is my set-up, and I started trying to train the network. I've been training the network with all 40,000 sentences. 1 epoch = 1 forward and backpropagation of every word in each sentence of the 40,000 training-set. So, just doing 1 epoch takes quite a few seconds. Just by knowing the word probabilities the network did pretty well, but the more I train it, nothing happens. The numbers that follow the epochs are the number of correctly tagged words divided by the total number of words.
First run 50 epochs: 0.928218786
100 epochs: 0.933130661
500 epochs: 0.928614499 took around 30 minutes to train this
Tried 10 epochs: 0.928953683
Using only 1 epoch had results that pretty much varied between .92 and .93
So, it doesn't appear to be working...
I then took 55 sentences from the corpus and used the same dictionary that had probabilities for all 40,000 words. For this one, I trained it in the same way I trained my XOR -- I only used those 55 sentences and I only tested the trained network weights on those 55 sentences. The network was able to learn those 55 sentences quite easily. With 120 epochs (taking a couple seconds) the network went from tagging 3768 incorrectly and 56 correctly (on the first few epochs) to tagging 3772 correctly and 52 incorrectly on the 120th epoch.
This is where I'm at, I've been trying to debug this for over a day now, and haven't figured anything out.