can i convert 3-gram txt to iob for crf suite

230 views Asked by At

The txt is in this format of 3-grams:

None,None,kgo,gop,ope,Test_Sepedi
None,kgo,gop,ope,pel,Test_Sepedi
kgo,gop,ope,pel,elo,Test_Sepedi
gop,ope,pel,elo,None,Test_Sepedi
ope,pel,elo,None,None,Test_Sepedi
None,None,gag,ago,None,Test_Sepedi
None,gag,ago,None,None,Test_Sepedi
None,None,gan,ann,nnw,Test_Sepedi
None,gan,ann,nnw,nwe,Test_Sepedi
gan,ann,nnw,nwe,None,Test_Sepedi
ann,nnw,nwe,None,None,Test_Sepedi
None,None,tla,None,None,Test_Sepedi

i want it to be in a format crfsuite will take for training which is this for example:

London JJ B-NP
shares NNS I-NP
closed VBD B-VP
moderately RB B-ADVP
lower JJR I-ADVP
in IN B-PP
thin JJ B-NP
trading NN I-NP

if i can convert it using python will be highly appreciated

2

There are 2 answers

2
Avi On BEST ANSWER

By the looks of the question, I assume that the input file is in csv format and the IOB2 format looks as though it is space or tab separated tokens. So the simplest way to achieve that format would be to read each line and replace the comma delimiter with a space.



    # fill in your paths here, do not copy and paste 
    output = open(OUTFILE_PATH, 'w')
    input = open(INPUT_PATH,'r') 
    data = input.readlines()
    input.close()

    for line in data:
        output_line = line.replace("\n","")
        # if the format requires a space then replace with a space
        # if the format requires a tab then replace with a tab
        # since your file seems to be comma separated, 
        #that is why I replace the comma below with a space

        output_line = output_line.replace(","," ")
        out_file.write(output_line+'\n')
    out_file.close()

Hope this helps!

0
jack On

cant see what you r trying to do i just give you my thoughts

out_file = open('./out', 'w')
for line in open('./in'):
    #do what ever you want to with input
    #and write output to output file
    out_file.write(result+'\n')
out_file.close()

hope this is helpful