I need help in training a data set which can then be tagged by tokenizing using pos tagger. My Input File is - kon_set1.txt containing text in Konkani(Indian Language).
ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात.
दांत आशिल्ल्यान तुमचो आत्मविश्वासय वाडटा.
आमच्या हड्ड्यां आनी दांतां मदीं बॅक्टेरिया आसतात.
त्यो दांत बुरशे आनी स्वास घाणयारो करतात.
हांगा दिल्ल्या कांय सोंप्या सुचोवण्यांच्या आदारान तुमी तुमचे दांत नितळ आनी स्वास ताजो दवरूंक शकतात.
I would like to know how training of this data set can be done. So that I can later use the trained data to tokenize using POS tagger. Thanking You. Awaiting for a positive response.
You have two possibilities:
You manually annotate a (preferably large) portion of text with PoS tags. Then you can train a tagger. This is called supervised training. You might need to revise the tagset first, though, since the English tagset might not be suitable for Konkani. And manual annotation is a time-consuming task.
Contrary to the comment by @Riyaz, it is indeed possible to do some kind of PoS tagging in an unsupervised way, ie. without labeled data (just raw text). See for example this 2009 paper by Chris Biemann for an application to English texts. This is however going to be much less accurate than supervised training. And you need a hell lot of text. Biemann suggests 50 million tokens for obtaining reasonable results.