Word Sense Disambiguation for Arabic text with NLTK

Question

Word Sense Disambiguation for Arabic text with NLTK

1.1k views Asked by aTifa Sofa At 05 August 2015 at 21:56

NLTK allows me to disambiguate text with nltk.wsd.lesk, e.g.

>>> from nltk.corpus import wordnet as wn
>>> from nltk.wsd import lesk
>>> sent = "I went to the bank to deposit money"
>>> ambiguous = "deposit"
>>> lesk(sent, ambiguous, pos='v')
Synset('deposit.v.02')

PyWSD does the same but it's only for English text.

NLTK supports arabic wordnet from the Open Multilingual WordNet, e.g.

>>> wn.synsets('deposit', pos='v')[1].lemma_names(lang='arb')
[u'\u0623\u064e\u0648\u0652\u062f\u064e\u0639\u064e']
>>> print wn.synsets('deposit', pos='v')[1].lemma_names(lang='arb')[0]
أَوْدَعَ

Also, the synsets are indexed for Arabic:

>>> wn.synsets(u'أَوْدَعَ', lang='arb')
[Synset('entrust.v.02'), Synset('deposit.v.02'), Synset('commit.v.03'), Synset('entrust.v.01'), Synset('consign.v.02')]

But how could i disambiguate Arabic texts and extract concepts from a query using nltk?

I was wondering if it is possible to use Lesk algorithm to deal with Arabic texts through nltk?

Original Q&A

There are 1 answers

**alvas** · Accepted Answer · 2015-08-05T22:01:10+00:00

It's a little tricky but maybe this will work:

Translate the sentence and the ambiguous word
Use lesk on the English version of the sentence

Try:

alvas@ubi:~$ wget -O translate.sh http://pastebin.com/raw.php?i=aHgFzmMU
--2015-08-05 23:32:46--  http://pastebin.com/raw.php?i=aHgFzmMU
Resolving pastebin.com (pastebin.com)... 190.93.241.15, 190.93.240.15, 141.101.112.16, ...
Connecting to pastebin.com (pastebin.com)|190.93.241.15|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘translate.sh’

    [ <=>                                                                                                                            ] 212         --.-K/s   in 0s      

2015-08-05 23:32:47 (9.99 MB/s) - ‘translate.sh’ saved [212]

alvas@ubi:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> import nltk
>>> from nltk.corpus import wordnet as wn
>>> text = 'لديه يودع المال في البنك'
>>> cmd = 'echo "{}" | bash translate.sh'.format(text)
>>> translation = os.popen(cmd).read()
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   193    0    40  100   153     21     83  0:00:01  0:00:01 --:--:--    83
>>> translation
'He has deposited the money in the bank. '
>>> ambiguous = u'أَوْدَعَ'
>>> wn.synsets(ambiguous, lang='arb')
[Synset('entrust.v.02'), Synset('deposit.v.02'), Synset('commit.v.03'), Synset('entrust.v.01'), Synset('consign.v.02')]
>>> nltk.wsd.lesk(translation_stems, '', synsets=wn.synsets(ambiguous,lang='arb'))
Synset('entrust.v.02')

But as you can see, there are many limitations:

Access to an MT system is not always easy (The above bash script using IBM API that will not last forever, it came from https://github.com/Rich-Edwards/fsharpwatson/blob/master/Command%20Line%20CURL%20Scripts)
Machine Translation will never be 100% accurate
Looking for the correct lemma in the Open Multilingual WordNet is not as easy as shown in the example, there's inflection and other morphemic variants to a stem.
WordNet is never complete, especially when it's the not English.
WSD is not 100% as human expected (Even between humans we vary our "senses", in the example above, some might say the WSD is right, some say it's better to use Synset('deposit.v.02'))

TechQA.

Word Sense Disambiguation for Arabic text with NLTK

There are 1 answers

Related Questions in PYTHON

Related Questions in NLP

Related Questions in NLTK

Related Questions in ARABIC

Related Questions in WORD-SENSE-DISAMBIGUATION

Popular Questions

Popular Tags

Trending Questions