nltk "OMW" wordnet with Arabic language

2.1k views Asked by At

I'm working on python/nltk with (OMW) wordnet specifically for The Arabic language. All the functions work fine with the English language yet I can't seem to be able to perform any of them when I use the 'arb' tag. The only thing that works great is extracting the lemma_names from a given Arabic synset.

The code below works fine with u'arb': The output is a list of Arabic lemmas.

for synset in wn.synsets(u'عام',lang=('arb')):
    for lemma in synset.lemma_names(u'arb'):
        print lemma

When I try to perform the same logic as the code above with synset, definitions, example, hypernyms, I get an error which says:

TypeError: hyponyms() takes exactly 1 argument (2 given)

(if I supply the 'arb' flag) or

KeyError: u'arb'

This is one of the codes that will not work if I write synset.hyponyms(u'arb'):

for synset in wn.synsets(u'عام',lang=('arb')):
    for hypo in synset.hyponyms(): #print the hyponyms in English not Arabic
        print hypo

Does this mean that I can't get to use wn.all_synsets and other built-in functions to extract all the Arabic synsets, hypernyms, etc?

1

There are 1 answers

2
alexis On

The nltk's Open Multilingual Wordnet has English names for all the synsets, since it is a multilingual database centered on the original English Wordnet. Synsets model meanings, hence they are language-independent and cannot be requested in a specific language. But each synset is linked to lemmas for the languages covered by the OMW. Once you have some synsets (original, hyponyms, etc.), just ask for the Arabic lemmas again:

>>> for synset in wn.synsets(u'عام',lang=('arb')):
...     for hypo in synset.hyponyms():
...         for lemma in hypo.lemmas("arb"):
...             print(lemma)
... 
Lemma('waft.v.01.إِنْبعث')
Lemma('waft.v.01.انبعث')
Lemma('waft.v.01.إنبعث_كالرائحة_العطرة')
Lemma('waft.v.01.إِنْدفع')
Lemma('waft.v.01.إِنْطلق')
Lemma('waft.v.01.انطلق')
Lemma('waft.v.01.حمل_بخفة')
Lemma('waft.v.01.دفع')
Lemma('calendar_year.n.01.سنة_شمْسِيّة')
Lemma('calendar_year.n.01.سنة_مدنِيّة')
Lemma('fiscal_year.n.01.سنة_ضرِيبِيّة')
Lemma('fiscal_year.n.01.سنة_مالِيّة')

In other words, the lemmas are multilingual, the synsets are not.