Need to set categorized corpus reader in NLTK and Python, corpus texts in one file, one text per line

Question

Need to set categorized corpus reader in NLTK and Python, corpus texts in one file, one text per line

2.7k views Asked by mel At 18 December 2014 at 03:37

I am getting familiar with NLTK and text categorization by Jacob Perkins's book "Python Text Processing with NLTK 2.0 Cookbook".

My corpus documents/texts each consists of a paragraph of text, so each of them is in a separate line of file, not in a separate file. The number of such these paragraphs/lines are about 2 millions. Therefore there are about 2 million on machine learning instances.

Each line in my file (a paragraph of text - a combination of domain title, description, keywords), that is a subject of feature extraction: tokenization, etc. to make it an instance for a machine learning algorithm.

I have two files like that with all the positives and negavives.

How can I load it to CategorizedCorpusReader? Is it possible?

I tried other solutions before, like scikit, and finally picked NLTK hoping for an easier point to start with a result.

Original Q&A

There are 1 answers

**Lino Silva** · Answer 1 · 2015-03-04T10:39:44+00:00

Assuming that you have two files:

file_pos.txt, file_neg.txt

from nltk.corpus.reader import CategorizedCorpusReader
reader = CategorizedCorpusReader('/path/to/corpora/', \
                                 r'file_.*\.txt', \
                                 cat_pattern=r'file_(\w+)\.txt')

After this, you can apply the usual Corpus functions to it like:

>>> reader.categories()
['neg', 'pos']
>>> reader.fileids(categories=['neg'])
['file_neg.txt']

As well as tagged_sents, tagged_words, etc.

You might enjoy this tutorial about creating a custom corpora: https://www.packtpub.com/books/content/python-text-processing-nltk-20-creating-custom-corpora

TechQA.

Need to set categorized corpus reader in NLTK and Python, corpus texts in one file, one text per line

There are 1 answers

Related Questions in PYTHON-2.7

Related Questions in TEXT

Related Questions in NLTK

Related Questions in CORPUS

Related Questions in CATEGORIZATION

Popular Questions

Popular Tags

Trending Questions