I have a database and API for hindi wordnet. I want to access this wordnet from NLTK python, so as to use NLTK Wordnet functions with our wordnet. Is there any way to add our own wordnet into NLTK? Or Are there any tools for Word Sense Disambiguation in Hindi (that can work with any Language Wordnet with some modifications) (which gives most suitable sense from wordnet)?

1

There are 1 answers

0
Everst On BEST ANSWER

If you look in your nltk_data folder, you'll see that wordnet like every other NLTK corpus is just a bunch of plain-text files. So, there must be a way to format your Hindi wordnet the same way as the NLTK one to use the functions. Here is the excerpt from the nltk.corpus.reader.wordnet object where these files are being read:

#: A list of file identifiers for all the fileids used by this
#: corpus reader.
_FILES = ('cntlist.rev', 'lexnames', 'index.sense',
          'index.adj', 'index.adv', 'index.noun', 'index.verb',
          'data.adj', 'data.adv', 'data.noun', 'data.verb',
          'adj.exc', 'adv.exc', 'noun.exc', 'verb.exc', )

def __init__(self, root):
    """
    Construct a new wordnet corpus reader, with the given root
    directory.
    """
    super(WordNetCorpusReader, self).__init__(root, self._FILES,
                                              encoding=self._ENCODING)

I suppose you don't really need to generate all these files but more importantly have to use the "index.sense" file for Word Sense Disambiguation. This is not generated by NLTK but have to be pre-processed before that or must be coming with your Hindi wordnet in the following format - http://wordnet.princeton.edu/wordnet/man/senseidx.5WN.html.

After you've done all steps I would just go to ../nltk/corpus/reader/wordnet.py and either create a copy of it where you can change the root and filenames and maybe some other dependencies but still use the functionality OR change what you need within existing classes (not recommended).

P.S. A little of googling gave me the link to http://www.cs.utexas.edu/~rashish/cs365ppt.pdf, which references a bunch of other sources on the subject.