I have a numerous plain text files(in .txt). I need to use the tagged corpus reader and have categories for my project, for that :
First I need this files to be tagged with the POS for each word.
Is there any library or a way to do this without code myself for iterating each word and find the POS and appending it using the '/' next to the word. Because if I do manually different paragraph all come to a single paragraph
The code that I have wrote to generate the text file where each word in it contains the POS of it (for instance, public will become public/JJ)
from nltk.corpus import PlaintextCorpusReader from nltk.tokenize import word_tokenize
import nltk
class CorpusInitialize:
def loading_textfile(self):
corpus_root = 'C:/Users/nkumarn/PycharmProjects/Corpus1/'
wordlists = PlaintextCorpusReader(corpus_root,'.*')
files=wordlists.fileids()
for eachfile in files:
textfile = wordlists.paras(fileids=eachfile)
text=self.set_paragraphs(textfile)
self.write_to_textfile(text,eachfile)
def set_paragraphs(self, textfile):
new_text = ""
flag = 0
for all_paras in textfile:
for every_para in all_paras:
if every_para != " ":
for every_word in every_para:
if new_text == "":
new_text = every_word
elif every_word == '.' or every_word == '?' or every_word == ',' or every_word == '!':
new_text = new_text + every_word
elif every_word == '@':
flag = 1
new_text = new_text + " " + every_word
else:
if flag == 1:
new_text = new_text + every_word
else:
new_text = new_text + " " + every_word
if new_text != "":
new_text = new_text + " " + '~'
text= self.create_corpos(new_text)
return text
def create_corpos(self,new_text):
words = word_tokenize(new_text)
all_pos_words = nltk.pos_tag(words)
text=""
for every_pos_words in all_pos_words:
if every_pos_words[0] == '~':
text = text + '\n'
continue
if text == "":
text = every_pos_words[0] + '/' + every_pos_words[1] + ' '
else:
text = text + every_pos_words[0] + '/' + every_pos_words[1] + ' '
return text
def write_to_textfile(self,textfile,fileid):
file = open("C:/Users/nkumarn/PycharmProjects/taggedcorpus/%s"%(fileid,), "w")
file.write(textfile)
file.close()
The input of this file is plain text file : For example :
"""Unsupervised learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. Unsupervised learning is closely related to the problem of density estimation in statistics.[1] However, unsupervised learning also encompasses many other techniques that seek to summarize and explain key features of the data."""
and the output of this file will be like this.
Unsupervised/VBN learning/NN is/VBZ the/DT machine/NN learning/VBG task/NN of/IN inferring/VBG a/DT function/NN to/TO describe/VB hidden/JJ structure/NN from/IN unlabeled/JJ data/NNS ./. Since/IN the/DT examples/NNS given/VBN to/TO the/DT learner/NN are/VBP unlabeled/VBN ,/, there/EX is/VBZ no/DT error/NN or/CC reward/JJ signal/NN to/TO evaluate/VB a/DT potential/JJ solution/NN ./. Unsupervised/VBN learning/NN is/VBZ closely/RB related/VBN to/TO the/DT problem/NN of/IN density/NN estimation/NN in/IN statistics/NNS ./. [/$ 1/CD ]/NNP However/RB ,/, unsupervised/JJ learning/NN also/RB encompasses/VBZ many/JJ other/JJ techniques/NNS that/WDT seek/VBP to/TO summarize/VB and/CC explain/VB key/JJ features/NNS of/IN the/DT data/NN ./.
So coming back to question, Is there any library or simpler yet effective way to come up with this output. I am trying get the output result something similar to brown corpus so I can make use of all the taggedcorperreader functions.
I have also worked with plain text corpus but that something that not needed for my project right now. PLEASE HELP ME FIND A SOLUTION... I HOPE THERE MUST BE A WAY WHICH I AM MISSING OUT