NLTK allows me to disambiguate text with nltk.wsd.lesk
, e.g.
>>> from nltk.corpus import wordnet as wn
>>> from nltk.wsd import lesk
>>> sent = "I went to the bank to deposit money"
>>> ambiguous = "deposit"
>>> lesk(sent, ambiguous, pos='v')
Synset('deposit.v.02')
PyWSD
does the same but it's only for English text.
NLTK supports arabic wordnet from the Open Multilingual WordNet, e.g.
>>> wn.synsets('deposit', pos='v')[1].lemma_names(lang='arb')
[u'\u0623\u064e\u0648\u0652\u062f\u064e\u0639\u064e']
>>> print wn.synsets('deposit', pos='v')[1].lemma_names(lang='arb')[0]
أَوْدَعَ
Also, the synsets are indexed for Arabic:
>>> wn.synsets(u'أَوْدَعَ', lang='arb')
[Synset('entrust.v.02'), Synset('deposit.v.02'), Synset('commit.v.03'), Synset('entrust.v.01'), Synset('consign.v.02')]
But how could i disambiguate Arabic texts and extract concepts from a query using nltk?
I was wondering if it is possible to use Lesk algorithm to deal with Arabic texts through nltk?
It's a little tricky but maybe this will work:
Try:
But as you can see, there are many limitations:
Synset('deposit.v.02')
)