For all other NLTK corpora, calling corpus.raw() yields the original text from the files.
For example:
>>> from nltk.corpus import webtext
>>> webtext.raw()[:10]
'Cookie Man'
However, when calling brown.raw() you get tagged text.
>>> from nltk.corpus import brown
>>> brown.raw()[:10]
'\n\n\tThe/at '
I've read all the documentation I can find but can't seem to find an obvious explanation or way to get the un-tagged version. Is there a reason this corpus is tagged and the others aren't?
TL;DR
In Long
It's because the "raw" version of the Brown corpus is tokenized and tagged i.e. the corpus comes tagged an that's the original form of the corpus =)
You can look at the individual files in your
nltk_datadirectory:If you want the words from the corpus, you can use
brown.words(), e.g.If you want to get words from a specific file:
And the sentences from a specific file:
To print out the individual sentences:
Trying to detokenize the tokenized corpus rather messy and may or may not work but you can try the
MosesDetokenizer:First download the data needed by the MosesDetokenizer:
Then initialize the
MosesDetokenizer:And use the
MosesDetokenizer.detokenize():To convert every sentence in
browninto natural reading text:[out]: