I'm using the Brown Corpus. I want some way to print out all the possible tags and their names (not just tag abbreviations). There are also quite a few tags, is there a way to 'simplify' the tags? By simplify I mean combine two extremely similar tags into one and re-tag the merged words with the other tag?
NLTK - Get and Simplify List of Tags
5.5k views Asked by Nate Cook3 At
2
There are 2 answers
0
On
Many of the tagsets in the NLTK's corpora come with predefined mappings to a simplified, "universal" tagset. In addition to being more convenient for many purposes, the simplified tagset allows a degree of compatibility between different corpora that allow remapping to the universal tagset.
For the brown corpus, you can simply fetch tagged words or sents like this:
brown.tagged_words(tagset="universal")
For example:
>>> print(brown.tagged_words()[:10])
[('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'),
('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'),
('of', 'ADP')]
To see the definitions of the original, complex tags in the Brown corpus, use
nltk.help.upenn_tagset()
(as mentioned also in this answer, linked by Alvas). You can get the whole list by calling it without arguments, or pass an argument (a regexp) to get only the matching tag(s). The results include a brief definition and examples.
>>> nltk.help.brown_tagset("DT.*")
DT: determiner/pronoun, singular
this each another that 'nother
DT$: determiner/pronoun, singular, genitive
another's
DT+BEZ: determiner/pronoun + verb 'to be', present tense, 3rd person singular
that's
DT+MD: determiner/pronoun + modal auxillary
that'll this'll
...
It's somehow discussed previously in:
Java Stanford NLP: Part of Speech labels?
Simplifying the French POS Tag Set with NLTK
https://linguistics.stackexchange.com/questions/2249/turn-penn-treebank-into-simpler-pos-tags
The POS tag output from
nltk.pos_tag
are PennTreeBank tagset, https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html, see What are all possible pos tags of NLTK?There are several approach but the simplest one might be to use only the first 2 characters of the POS as the main set of POS tags. This is because the first two characters in the POS tag represents the broad classes of POS in Penn Tree Bank tagset.
For instance
NNS
means plural noun, andNNP
means proper noun and theNN
tag subsumes all of it by representing the generic noun.Here's a code example:
The shorten version looks like this:
Another solution is to use the universal postags, see http://www.nltk.org/book/ch05.html