I am attempting to write a function that will return a list of NLTK definitions for the 'tokens' tokenized from a text document subject to constraint of part of speech of the word.
I first converted the tag given by nltk.pos_tag to the tag used by wordnet.synsets and then applied .word_tokenize(), .pos_tag(), .synsets in turn, as seen in the following code:
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
#convert the tag to the one used by wordnet.synsets
def convert_tag(tag):
tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
try:
return tag_dict[tag[0]]
except KeyError:
return None
#tokenize, tag, and find synsets (give the first match between each 'token' and 'word net_tag')
def doc_to_synsets(doc):
token = nltk.word_tokenize(doc)
tag = nltk.pos_tag(token)
wordnet_tag = convert_tag(tag)
syns = wn.synsets(token, wordnet_tag)
return syns[0]
#test
doc = 'document is a test'
doc_to_synsets(doc)
which, if programmed correctly, should return something like
[Synset('document.n.01'), Synset('be.v.01'), Synset('test.n.01')]
However, Python throws an error message:
'list' object has no attribute 'lower'
I also noticed that in the error message, it says
lemma = lemma.lower()
Does that mean I also need to 'lemmatize' my tokens as this previous thread suggest? Or should I apply .lower() on the text document before doing all these?
I will rather new to wordnet, don't really know whether it's .synsets that is causing the problem or it's the nltk part that is at fault. It will be really appreciated if someone could enlighten me on this.
Thank you.
[Edit] error traceback
AttributeError Traceback (most recent call last)
<ipython-input-49-5bb011808dce> in <module>()
22 return syns
23
---> 24 doc_to_synsets('document is a test.')
25
26
<ipython-input-49-5bb011808dce> in doc_to_synsets(doc)
18 tag = nltk.pos_tag(token)
19 wordnet_tag = convert_tag(tag)
---> 20 syns = wn.synsets(token, wordnet_tag)
21
22 return syns
/opt/conda/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py in synsets(self, lemma, pos, lang, check_exceptions)
1481 of that language will be returned.
1482 """
-> 1483 lemma = lemma.lower()
1484
1485 if lang == 'eng':
AttributeError: 'list' object has no attribute 'lower'
So after using the code kindly suggested by @dugup and $udiboy1209, I get the following output
[[Synset('document.n.01'),
Synset('document.n.02'),
Synset('document.n.03'),
Synset('text_file.n.01'),
Synset('document.v.01'),
Synset('document.v.02')],
[Synset('be.v.01'),
Synset('be.v.02'),
Synset('be.v.03'),
Synset('exist.v.01'),
Synset('be.v.05'),
Synset('equal.v.01'),
Synset('constitute.v.01'),
Synset('be.v.08'),
Synset('embody.v.02'),
Synset('be.v.10'),
Synset('be.v.11'),
Synset('be.v.12'),
Synset('cost.v.01')],
[Synset('angstrom.n.01'),
Synset('vitamin_a.n.01'),
Synset('deoxyadenosine_monophosphate.n.01'),
Synset('adenine.n.01'),
Synset('ampere.n.02'),
Synset('a.n.06'),
Synset('a.n.07')],
[Synset('trial.n.02'),
Synset('test.n.02'),
Synset('examination.n.02'),
Synset('test.n.04'),
Synset('test.n.05'),
Synset('test.n.06'),
Synset('test.v.01'),
Synset('screen.v.01'),
Synset('quiz.v.01'),
Synset('test.v.04'),
Synset('test.v.05'),
Synset('test.v.06'),
Synset('test.v.07')],
[]]
The problem now comes down to extracting the first match (or first element) of each list from the list 'syns' and make them into a new list. For the trial document 'document is a test', it should return:
[Synset('document.n.01'), Synset('be.v.01'), Synset('angstrom.n.01'), Synset('trial.n.02')]
which is a list of the first match for each token in the text document.
lower()
is a function ofstr
type, which basically returns a lower-case version of the string.It looks like
nltk.word_tokenize()
returns a list of words, and not a single word. But tosynsets()
you need to pass a single str, and not a list of str.You may want to try running
synsets
in a loop like so:EDIT: better use list comprehensions to get a list of syns