How to reduce the number of POS tags in Penn Treebank? - NLTK (Python)

1.8k views Asked by At

I used nltk for part of speech tagging. It has 36 Penn Treebank. I want to reduce the number of tags to 6 :"noun, verb, adjective, adverb, preposition, conjunction" How should I do so? Is there any specific function attribute? or command?

4

There are 4 answers

1
Chiarcos On

You cannot reduce to these 6 tags, because there will be an "other" category for things like determiners or pronouns that cannot be directly reduced to any of the categories you mention.

Having that said, the short answer is:

  • Check out this link for a mapping table in HTML
  • This does perform a live lookup for your specific reduction in the Ontologies for Linguistic Annotation (see long answer on explanation)
  • To use that mapping directly in NLTK, define JSON as output format and parse that into a Python dict as via this link

The long answer:

  • To reduce the tags to your "target tags", you can use the Ontologies of Linguistic Annotation [disclosure: I'm maintaining these] with the following SPARQL query:

      PREFIX system: <http://purl.org/olia/system.owl#>
      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      PREFIX owl: <http://www.w3.org/2002/07/owl#>
      PREFIX olia: <http://purl.org/olia/olia.owl#>
    
      # columns of the mapping table 
      SELECT distinct ?tag ?category
    
      # lookup in the Ontologies of Linguistic Annotation
      FROM <http://purl.org/olia/penn.owl>        # Penn tags 
      FROM <http://purl.org/olia/olia.owl>        # reference concepts (Noun etc.)
      FROM <http://purl.org/olia/penn-link.rdf>   # Penn -> reference concepts
    
      # the actual query
      WHERE { 
    
          # for an element with a particular tag
          ?a system:hasTag ?tag.
    
          # retrieve all its super classes
          OPTIONAL { 
              ?a a/(rdfs:subClassOf|owl:equivalentClass|
      owl:unionOf|owl:intersectionOf)* ?b. 
    
              # but only if they match your target categories
              # see http://purl.org/olia/olia.owl for their definitions
              FILTER(?b in (
                  olia:Noun, olia:Verb, olia:Adjective,
                  olia:Adverb, olia:Preposition, 
                  olia:Conjunction
                  )) 
          }
    
          # return the local name of the target category
          # if none of your target categories can be found, return "OTHER" 
          BIND(if(bound(?b), replace(str(?b),".*[#/]",""), "OTHER") AS ?category)
      }
      ORDER BY ?tag
    
  • See inline comments for explanation. You can adjust the filter conditions to get more, fewer or other categories. Note that this query can return multiple mappings if Penn tags are ambiguous (disjunction, i.e. owl:unionOf).

  • No need to set up your own end point for such occasional queries, just go to http://sparql.org/sparql.html and copy and paste (and edit) that query. Different output formats are possible, select "Output XML" and the default XSL stylesheet to get a HTML view.

  • The entire query can be condensed into a single URI (as above). You can customize your query and output formats, click on "Get Results" and copy the URL of the resulting page. (Or build it yourself, using standard URI escaping.)

  • Note that whenever you click on that link, you run a live query. Better do that once and store your mapping table.

  • Note that the complex expression (rdfs:subClassOf|owl:equivalentClass| owl:unionOf|owl:intersectionOf)* allows you to search over OWL axioms. However, this is search, not reasoning, so you will only retrieve classes that are explicitly defined as superclasses.

  • Note that owl:unionOf is a logical or. There is no way to disambiguate that by means of a SPARQL query, if you want to treat tags with ambiguous definitions as OTHER, remove that expression from the property path.

  • Also note that this is not restricted to Penn, OLiA supports tagsets for more than 100 languages, see http://purl.org/olia

0
alexis On

Ask for the "universal" tagset:

treebank.tagged_sents(tagset="universal")

It's not quite the list you specify (e.g., it didn't forget about determiners), but it comes close. If you still don't like it, you can rename the rest of the POS tags yourself.

0
alvas On

The UPenn tagset documentation can be accessed as such:

>>> import nltk
>>> nltk.help.upenn_tagset()

What are all possible pos tags of NLTK? has a good detailed discussion/description of it.


Note that while the Wall Street Journal (wsj) subset of the Penn Treebank (PTB) uses the UPenn tagset, the brown corpus (a subset of the PTB) has a finer grain tagset:

>>> nltk.help.brown_tagset()

Although the original PTB has the upenn and brown tags, the tags in the treebank corpus can be mapped. As @alexis has shown, the Universal Tagset of the PTB corpus can be accessed as such:

treebank.tagged_sents(tagset="universal")

They are mapped to the Universal Tagset by the nltk.tag.mapping.tagset_mapping using the mapping resources from nltk_data/taggers/universal_tagset/en-*.map files:

~/nltk_data/taggers/universal_tagset$ ls
README             de-negra.map       en-tweet.map       fi-tdt.map         ja-verbmobil.map   sl-sdt.map
ar-padt.map        de-tiger.map       es-cast3lb.map     fr-paris.map       ko-sejong.map      sv-talbanken.map
bg-btb.map         el-gdt.map         es-eagles.map      hu-szeged.map      nl-alpino.map      tu-metusbanci.map
ca-cat3lb.map      en-brown.map       es-iula.map        it-isst.map        pl-ipipan.map      universal_tags.py
cs-pdt.map         en-ptb.map         es-treetagger.map  iw-mila.map        pt-bosque.map      zh-ctb6.map
da-ddt.map         en-tweet.README    eu-eus3lb.map      ja-kyoto.map       ru-rnc.map         zh-sinica.map
0
Guiem Bosch On

I recommend you to use the tagset_mapping method. If you ask it to map from en-ptb (the Penn Treebank PoS) to universal you will reduce the number of PoS tags.

This is a very simple example to see how to incorporate the method:

from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.tag.mapping import tagset_mapping

PTB_UNIVERSAL_MAP = tagset_mapping('en-ptb', 'universal')

def to_universal(tagged_words):
    return [(word, PTB_UNIVERSAL_MAP[tag]) for word, tag in tagged_words]

text = "This is a very simple example."
pos_tagged = [(word, tag) for word, tag in pos_tag(word_tokenize(text))]

You can observe the difference before and after the mapping:

print(pos_tagged)
>>>[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('very', 'RB'), ('simple', 'JJ'), ('example', 'NN'), ('.', '.')]

print(to_universal(pos_tagged))
>>> [('This', 'DET'), ('is', 'VERB'), ('a', 'DET'), ('very', 'ADV'), ('simple', 'ADJ'), ('example', 'NOUN'), ('.', '.')]

I would advice you to stick to this mapping, even though there are more resultant tags than desired. This way you'll follow sort of a "convention". Besides, the "extra" tags are mostly about punctuation.

In case you strictly want to map to your fixed set "noun, verb, adjective, adverb, preposition, conjunction" you can always use the map_tag method.

Notice you might have to download extra resources:

import nltk
nltk.download('universal_tagset')