how can i convert list of sentences to IOB format, saving the sentences separation in the output

1.9k views Asked by At

i have some txt file which i need to convert to IOB format for CRF model.

Using nltk tree2conlltags i can convert tokenized, postagged text into IOB format that i need.

Like this

("u'Is", 'JJ', u'O')
('Miami', 'NNP', u'B-PERSON')
('playing', 'NN', u'O')
('in', 'IN', u'O')
('Washigthon', 'NNP', u'B-GPE')
('this', 'DT', u'O')
('month', 'NN', u'O')
('?', '.', u'O')

But the problem is that as output i get one word as one element, but i need one sentence as element.

Also i tried firstly separate text into sentences and then tokenize them, so i'll save sentences boundAries, but nltk pos tagger doesn't accept list type data.

Maybe there is the whole new approach to get the format i need or

1

There are 1 answers

1
lenz On

It's easy to concatenate the tokens, PoS tags and the NER labels into one string each for every sentence, eg. like this (token_wise is the data from your example):

>>> tuple(' '.join(layer) for layer in zip(*token_wise))
("u'Is Miami playing in Washigthon this month ?",
 'JJ NNP NN IN NNP DT NN .',
 'O B-PERSON O O B-GPE O O O')

You'd have to repeat that for each sentence. But it doesn't make any sense. Your CRF tagger will have no chance to predict a complex label like 'O B-PERSON O O B-GPE O O O', because you'll have a huge sparse-data problem. Most labels will only be seen once, and even more so the input sentences.

Also, this is not IOB format. In IOB, you have either I, O, or B per element, but not a combination of them.