How to create a categorized tagged corpus reader

502 views Asked by At

I have a bunch of files and categories listed in cats.txt in the same folder. I want to create a CategorizedTaggedCorpusReader for this. enter image description here

This is how my files look.

Tried many ways in nltk and failed to create Categorizedtaggedcorpusreader. Inside my cats.txt I have filename and the category name with space apart, each filename can have multiple categories.

For instance :

mail_1_adapter adapter 
mail_1_alert alert 
messagebody_24862499 others
etc.

Can you please show me a better way where I can create my corpus and make use of it.

1

There are 1 answers

3
alexis On

Your file format is fine. How exactly did you try to create your reader and it didn't work? You don't show your code, so there's no telling what you're doing wrong. You need to tell your reader that it should read the categories from the file cats.txt, e.g. like this:

 from nltk.corpus.reader import CategorizedTaggedCorpusReader
 reader = CategorizedTaggedCorpusReader(<path>, r"^[^.]*$", cat_file="cats.txt")

Your categories file cats.txt is not part of the corpus, so I used the regexp ^[^.]*$ which matches everything not containing a dot. If this doesn't correctly describe your files, change the definition as needed to include all corpus files, but exclude cats.txt.