i have some txt file which i need to convert to IOB format for CRF model.
Using nltk tree2conlltags i can convert tokenized, postagged text into IOB format that i need.
Like this
("u'Is", 'JJ', u'O')
('Miami', 'NNP', u'B-PERSON')
('playing', 'NN', u'O')
('in', 'IN', u'O')
('Washigthon', 'NNP', u'B-GPE')
('this', 'DT', u'O')
('month', 'NN', u'O')
('?', '.', u'O')
But the problem is that as output i get one word as one element, but i need one sentence as element.
Also i tried firstly separate text into sentences and then tokenize them, so i'll save sentences boundAries, but nltk pos tagger doesn't accept list type data.
Maybe there is the whole new approach to get the format i need or
It's easy to concatenate the tokens, PoS tags and the NER labels into one string each for every sentence, eg. like this (
token_wise
is the data from your example):You'd have to repeat that for each sentence. But it doesn't make any sense. Your CRF tagger will have no chance to predict a complex label like
'O B-PERSON O O B-GPE O O O'
, because you'll have a huge sparse-data problem. Most labels will only be seen once, and even more so the input sentences.Also, this is not IOB format. In IOB, you have either I, O, or B per element, but not a combination of them.