How to Train an Input File containing lines of text in NLTK Python

971 views Asked by At

I need help in training a data set which can then be tagged by tokenizing using pos tagger. My Input File is - kon_set1.txt containing text in Konkani(Indian Language).

ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात.
दांत आशिल्ल्यान तुमचो आत्मविश्वासय वाडटा.
आमच्या हड्ड्यां आनी दांतां मदीं बॅक्टेरिया आसतात.
त्यो दांत बुरशे आनी स्वास घाणयारो करतात.
हांगा दिल्ल्या कांय सोंप्या सुचोवण्यांच्या आदारान तुमी तुमचे दांत नितळ आनी स्वास ताजो दवरूंक शकतात.

I would like to know how training of this data set can be done. So that I can later use the trained data to tokenize using POS tagger. Thanking You. Awaiting for a positive response.

4

There are 4 answers

1
lenz On

You have two possibilities:

  1. You manually annotate a (preferably large) portion of text with PoS tags. Then you can train a tagger. This is called supervised training. You might need to revise the tagset first, though, since the English tagset might not be suitable for Konkani. And manual annotation is a time-consuming task.

  2. Contrary to the comment by @Riyaz, it is indeed possible to do some kind of PoS tagging in an unsupervised way, ie. without labeled data (just raw text). See for example this 2009 paper by Chris Biemann for an application to English texts. This is however going to be much less accurate than supervised training. And you need a hell lot of text. Biemann suggests 50 million tokens for obtaining reasonable results.

0
alexis On

Konkani is not such an obscure language. If your goal is to train a tagger, find a tagged corpus to use as training materials. If your goal is to tag your own text, do the same or look for a pre-trained tagger. Googling "Konkani trained corpus" gives a ton of hits. Look through them.

Note the terminology: You train a tagger. You tag or annotate a corpus (by hand or by tool).

You could hand-annotate your corpus, as @Lenz suggests, but I wouldn't recommend it. Annotating a corpus of sufficient size to train a tagger is a huge undertaking.

I also wouldn't advise you to try to devise an unsupervised method, because (a) this is an open research problem and (b) you're having enough trouble with the simple stuff. So first things first: Find yourself a tagged corpus.

0
Ashay Naik On

Thank you very much for the suggestions. It worked out for us using tnT tagger. We have defined a corpus named konkani.pos and included it in Indian Corpus folder. now we are able to fetch the lines of data in trained data set and test it via the KonkaniTest.text file.

1
Ashay Naik On

Now, on finding the frequency of the occurrences of the tagged words, on using the function

x=FreqDist(train_data)
and
print(x)

prints a few tagged words and followed by ... thereby not listing all the tagged words. How can I see all the tagged words? and len(x) gives the count of the number of tagged words.