Text representation for Neural training Network

55 views Asked by At

I'm developing a Neural training network with nntool in Matlab and I have as inputs 11250 text files with different lengths (from 10 to 500 words or let's say from 10 to 200 words if I eliminate redundant words ), I didn't find a good method to represent this input texts as a digital data to run my training algorithm. I thought about creating a vocabulary of words, but I've found that the vocabulary contains 16000 different words which is huge. There are some words in common between some text files.

1

There are 1 answers

0
404pio On

For quick sollution you should look for "bag of words" or "tfidf". If you don't know what is this, you should start here: https://en.wikipedia.org/wiki/Vector_space_model or https://en.wikipedia.org/wiki/Document_classification .

Have you read any book about NLP? Maybe this one may be valuable: http://www.nltk.org/book/ at the very begin.