I used nltk for part of speech tagging. It has 36 Penn Treebank. I want to reduce the number of tags to 6 :"noun, verb, adjective, adverb, preposition, conjunction" How should I do so? Is there any specific function attribute? or command?
How to reduce the number of POS tags in Penn Treebank? - NLTK (Python)
1.8k views Asked by user8049144 AtThere are 4 answers
The UPenn tagset documentation can be accessed as such:
>>> import nltk
>>> nltk.help.upenn_tagset()
What are all possible pos tags of NLTK? has a good detailed discussion/description of it.
Note that while the Wall Street Journal (wsj
) subset of the Penn Treebank (PTB) uses the UPenn tagset, the brown
corpus (a subset of the PTB) has a finer grain tagset:
>>> nltk.help.brown_tagset()
Although the original PTB has the upenn
and brown
tags, the tags in the treebank
corpus can be mapped. As @alexis has shown, the Universal Tagset of the PTB corpus can be accessed as such:
treebank.tagged_sents(tagset="universal")
They are mapped to the Universal Tagset by the nltk.tag.mapping.tagset_mapping
using the mapping resources from nltk_data/taggers/universal_tagset/en-*.map
files:
~/nltk_data/taggers/universal_tagset$ ls
README de-negra.map en-tweet.map fi-tdt.map ja-verbmobil.map sl-sdt.map
ar-padt.map de-tiger.map es-cast3lb.map fr-paris.map ko-sejong.map sv-talbanken.map
bg-btb.map el-gdt.map es-eagles.map hu-szeged.map nl-alpino.map tu-metusbanci.map
ca-cat3lb.map en-brown.map es-iula.map it-isst.map pl-ipipan.map universal_tags.py
cs-pdt.map en-ptb.map es-treetagger.map iw-mila.map pt-bosque.map zh-ctb6.map
da-ddt.map en-tweet.README eu-eus3lb.map ja-kyoto.map ru-rnc.map zh-sinica.map
I recommend you to use the tagset_mapping
method. If you ask it to map from en-ptb
(the Penn Treebank PoS) to universal
you will reduce the number of PoS tags.
This is a very simple example to see how to incorporate the method:
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.tag.mapping import tagset_mapping
PTB_UNIVERSAL_MAP = tagset_mapping('en-ptb', 'universal')
def to_universal(tagged_words):
return [(word, PTB_UNIVERSAL_MAP[tag]) for word, tag in tagged_words]
text = "This is a very simple example."
pos_tagged = [(word, tag) for word, tag in pos_tag(word_tokenize(text))]
You can observe the difference before and after the mapping:
print(pos_tagged)
>>>[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('very', 'RB'), ('simple', 'JJ'), ('example', 'NN'), ('.', '.')]
print(to_universal(pos_tagged))
>>> [('This', 'DET'), ('is', 'VERB'), ('a', 'DET'), ('very', 'ADV'), ('simple', 'ADJ'), ('example', 'NOUN'), ('.', '.')]
I would advice you to stick to this mapping, even though there are more resultant tags than desired. This way you'll follow sort of a "convention". Besides, the "extra" tags are mostly about punctuation.
In case you strictly want to map to your fixed set "noun, verb, adjective, adverb, preposition, conjunction" you can always use the map_tag method.
Notice you might have to download extra resources:
import nltk
nltk.download('universal_tagset')
You cannot reduce to these 6 tags, because there will be an "other" category for things like determiners or pronouns that cannot be directly reduced to any of the categories you mention.
Having that said, the short answer is:
The long answer:
To reduce the tags to your "target tags", you can use the Ontologies of Linguistic Annotation [disclosure: I'm maintaining these] with the following SPARQL query:
See inline comments for explanation. You can adjust the filter conditions to get more, fewer or other categories. Note that this query can return multiple mappings if Penn tags are ambiguous (disjunction, i.e.
owl:unionOf
).No need to set up your own end point for such occasional queries, just go to http://sparql.org/sparql.html and copy and paste (and edit) that query. Different output formats are possible, select "Output XML" and the default XSL stylesheet to get a HTML view.
The entire query can be condensed into a single URI (as above). You can customize your query and output formats, click on "Get Results" and copy the URL of the resulting page. (Or build it yourself, using standard URI escaping.)
Note that whenever you click on that link, you run a live query. Better do that once and store your mapping table.
Note that the complex expression
(rdfs:subClassOf|owl:equivalentClass| owl:unionOf|owl:intersectionOf)*
allows you to search over OWL axioms. However, this is search, not reasoning, so you will only retrieve classes that are explicitly defined as superclasses.Note that
owl:unionOf
is a logical or. There is no way to disambiguate that by means of a SPARQL query, if you want to treat tags with ambiguous definitions asOTHER
, remove that expression from the property path.Also note that this is not restricted to Penn, OLiA supports tagsets for more than 100 languages, see http://purl.org/olia