How to create a tagged corpus text files

466 views Asked by At

I have a numerous plain text files(in .txt). I need to use the tagged corpus reader and have categories for my project, for that :


First I need this files to be tagged with the POS for each word.

Is there any library or a way to do this without code myself for iterating each word and find the POS and appending it using the '/' next to the word. Because if I do manually different paragraph all come to a single paragraph


The code that I have wrote to generate the text file where each word in it contains the POS of it (for instance, public will become public/JJ)

from nltk.corpus import PlaintextCorpusReader from nltk.tokenize import word_tokenize

import nltk

class CorpusInitialize:

    def loading_textfile(self):

        corpus_root = 'C:/Users/nkumarn/PycharmProjects/Corpus1/'
        wordlists = PlaintextCorpusReader(corpus_root,'.*')
        files=wordlists.fileids()
        for eachfile in files:
            textfile = wordlists.paras(fileids=eachfile)
            text=self.set_paragraphs(textfile)
            self.write_to_textfile(text,eachfile)



    def set_paragraphs(self, textfile):
        new_text = ""
        flag = 0
        for all_paras in textfile:
            for every_para in all_paras:
                if every_para != " ":
                    for every_word in every_para:
                        if new_text == "":
                            new_text = every_word
                        elif every_word == '.' or every_word == '?' or every_word == ',' or every_word == '!':
                            new_text = new_text + every_word
                        elif every_word == '@':
                            flag = 1
                            new_text = new_text + " " + every_word
                        else:
                            if flag == 1:
                                new_text = new_text + every_word
                            else:
                                new_text = new_text + " " + every_word

            if new_text != "":
                new_text = new_text + " " + '~'
        text= self.create_corpos(new_text)
        return text


    def create_corpos(self,new_text):
        words = word_tokenize(new_text)
        all_pos_words = nltk.pos_tag(words)
        text=""
        for every_pos_words in all_pos_words:
            if every_pos_words[0] == '~':
                text = text + '\n'
                continue

            if text == "":
                text = every_pos_words[0] + '/' + every_pos_words[1] + ' '
            else:
                text = text + every_pos_words[0] + '/' + every_pos_words[1] + ' '


        return text


    def write_to_textfile(self,textfile,fileid):

        file = open("C:/Users/nkumarn/PycharmProjects/taggedcorpus/%s"%(fileid,), "w")
        file.write(textfile)
        file.close()

The input of this file is plain text file : For example :

"""Unsupervised learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. Unsupervised learning is closely related to the problem of density estimation in statistics.[1] However, unsupervised learning also encompasses many other techniques that seek to summarize and explain key features of the data."""

and the output of this file will be like this.

Unsupervised/VBN learning/NN is/VBZ the/DT machine/NN learning/VBG task/NN of/IN inferring/VBG a/DT function/NN to/TO describe/VB hidden/JJ structure/NN from/IN unlabeled/JJ data/NNS ./. Since/IN the/DT examples/NNS given/VBN to/TO the/DT learner/NN are/VBP unlabeled/VBN ,/, there/EX is/VBZ no/DT error/NN or/CC reward/JJ signal/NN to/TO evaluate/VB a/DT potential/JJ solution/NN ./. Unsupervised/VBN learning/NN is/VBZ closely/RB related/VBN to/TO the/DT problem/NN of/IN density/NN estimation/NN in/IN statistics/NNS ./. [/$ 1/CD ]/NNP However/RB ,/, unsupervised/JJ learning/NN also/RB encompasses/VBZ many/JJ other/JJ techniques/NNS that/WDT seek/VBP to/TO summarize/VB and/CC explain/VB key/JJ features/NNS of/IN the/DT data/NN ./.

So coming back to question, Is there any library or simpler yet effective way to come up with this output. I am trying get the output result something similar to brown corpus so I can make use of all the taggedcorperreader functions.

I have also worked with plain text corpus but that something that not needed for my project right now. PLEASE HELP ME FIND A SOLUTION... I HOPE THERE MUST BE A WAY WHICH I AM MISSING OUT

0

There are 0 answers