'list' object has no attribute 'lower' issue in wordnet synsets

4.1k views Asked by At

I am attempting to write a function that will return a list of NLTK definitions for the 'tokens' tokenized from a text document subject to constraint of part of speech of the word.

I first converted the tag given by nltk.pos_tag to the tag used by wordnet.synsets and then applied .word_tokenize(), .pos_tag(), .synsets in turn, as seen in the following code:

import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd

#convert the tag to the one used by wordnet.synsets

def convert_tag(tag):    
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None

#tokenize, tag, and find synsets (give the first match between each 'token' and 'word net_tag')

def doc_to_synsets(doc):

    token = nltk.word_tokenize(doc)
    tag = nltk.pos_tag(token)
    wordnet_tag = convert_tag(tag)
    syns = wn.synsets(token, wordnet_tag)

    return syns[0]

#test
doc = 'document is a test'
doc_to_synsets(doc)

which, if programmed correctly, should return something like

[Synset('document.n.01'), Synset('be.v.01'), Synset('test.n.01')]

However, Python throws an error message:

'list' object has no attribute 'lower'

I also noticed that in the error message, it says

lemma = lemma.lower()

Does that mean I also need to 'lemmatize' my tokens as this previous thread suggest? Or should I apply .lower() on the text document before doing all these?

I will rather new to wordnet, don't really know whether it's .synsets that is causing the problem or it's the nltk part that is at fault. It will be really appreciated if someone could enlighten me on this.

Thank you.

[Edit] error traceback

AttributeError                            Traceback (most recent call last)
<ipython-input-49-5bb011808dce> in <module>()
     22     return syns
     23 
---> 24 doc_to_synsets('document is a test.')
     25 
     26 

<ipython-input-49-5bb011808dce> in doc_to_synsets(doc)
     18     tag = nltk.pos_tag(token)
     19     wordnet_tag = convert_tag(tag)
---> 20     syns = wn.synsets(token, wordnet_tag)
     21 
     22     return syns

/opt/conda/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py in synsets(self, lemma, pos, lang, check_exceptions)
   1481         of that language will be returned.
   1482         """
-> 1483         lemma = lemma.lower()
   1484 
   1485         if lang == 'eng':

AttributeError: 'list' object has no attribute 'lower'

So after using the code kindly suggested by @dugup and $udiboy1209, I get the following output

[[Synset('document.n.01'),
  Synset('document.n.02'),
  Synset('document.n.03'),
  Synset('text_file.n.01'),
  Synset('document.v.01'),
  Synset('document.v.02')],
 [Synset('be.v.01'),
  Synset('be.v.02'),
  Synset('be.v.03'),
  Synset('exist.v.01'),
  Synset('be.v.05'),
  Synset('equal.v.01'),
  Synset('constitute.v.01'),
  Synset('be.v.08'),
  Synset('embody.v.02'),
  Synset('be.v.10'),
  Synset('be.v.11'),
  Synset('be.v.12'),
  Synset('cost.v.01')],
 [Synset('angstrom.n.01'),
  Synset('vitamin_a.n.01'),
  Synset('deoxyadenosine_monophosphate.n.01'),
  Synset('adenine.n.01'),
  Synset('ampere.n.02'),
  Synset('a.n.06'),
  Synset('a.n.07')],
 [Synset('trial.n.02'),
  Synset('test.n.02'),
  Synset('examination.n.02'),
  Synset('test.n.04'),
  Synset('test.n.05'),
  Synset('test.n.06'),
  Synset('test.v.01'),
  Synset('screen.v.01'),
  Synset('quiz.v.01'),
  Synset('test.v.04'),
  Synset('test.v.05'),
  Synset('test.v.06'),
  Synset('test.v.07')],
 []]

The problem now comes down to extracting the first match (or first element) of each list from the list 'syns' and make them into a new list. For the trial document 'document is a test', it should return:

[Synset('document.n.01'), Synset('be.v.01'), Synset('angstrom.n.01'), Synset('trial.n.02')]

which is a list of the first match for each token in the text document.

2

There are 2 answers

0
udiboy1209 On

lower() is a function of str type, which basically returns a lower-case version of the string.

It looks like nltk.word_tokenize() returns a list of words, and not a single word. But to synsets() you need to pass a single str, and not a list of str.

You may want to try running synsets in a loop like so:

for token in nltk.word_tokenize(doc):
    syn = wn.synsets(token)

EDIT: better use list comprehensions to get a list of syns

syns = [wn.synsets(token) for token in nltk.word_tokenize(doc)]
8
dugup On

The problem is that wn.synsets expects a single token as its first argument but word_tokenize returns a list containing all of the tokens in the document. So your token and tag variables are actually lists.

You need to loop through all of the token-tag pairs in your document and generate a synset for each individually using something like:

tokens = nltk.word_tokenize(doc)
tags = nltk.pos_tag(tokens)
doc_synsets = []
for token, tag in zip(tokens, tags):
    wordnet_tag = convert_tag(tag)
    syns = wn.synsets(token, wordnet_tag)
    # only add the first matching synset to results
    doc_synsets.append(syns[0])