NLTK saving trained Brill's model

743 views Asked by At

I am training a Brill's POS tagger using the py-crfsuite as provided in NLTK. However when I try to save a trained model, I get the following error:

crf_tagger = CRFTagger()    
crf_tagger.train(train_sents, 'model_trained.crf.tagger')
templates = nltk.tag.brill.nltkdemo18()
trainer = nltk.tag.brill_trainer.BrillTaggerTrainer(crf_tagger, templates)
bt = trainer.train(train_sents, max_rules=10)

file_writing = file('trained_brill_tagger.yaml', 'w')
yaml.dump(bt, file_writing)

#even pickle fails
file_w = open('trained_brills.pickle', 'wb')
pickle.dump(bt, file_w)
file_w.close()

File "stringsource", line 2, in pycrfsuite._pycrfsuite.Tagger.reduce_cython TypeError: self.c_tagger cannot be converted to a Python object for pickling

I have tried using pickle, dill and also yaml however the error seems to persist. Is there any solution to this. Is this because of using CRF tagger as baseline? Thank you.

2

There are 2 answers

3
alvas On

Here's an example of how you can train a nltk.tag.brill_trainer.BrillTaggerTrainer in NLTK v3.2.5

from nltk.corpus import treebank

from nltk.tag import BrillTaggerTrainer, RegexpTagger, UnigramTagger
from nltk.tbl.demo import REGEXP_TAGGER, _demo_prepare_data, _demo_prepare_data
from nltk.tag.brill import describe_template_sets, brill24

baseline_backoff_tagger = REGEXP_TAGGER
templates = brill24()
tagged_data = treebank.tagged_sents()
train=0.8
trace=3
num_sents=1000
randomize=False
separate_baseline_data=False

(training_data, baseline_data, gold_data, testing_data) = \
   _demo_prepare_data(tagged_data, train, num_sents, randomize, separate_baseline_data)

baseline_tagger = UnigramTagger(baseline_data, backoff=baseline_backoff_tagger)

# creating a Brill tagger
trainer = BrillTaggerTrainer(baseline_tagger, templates, trace, ruleformat="str")

Then to save the trainer, simply pickle:

import pickle
with open('brill-demo.pkl', 'wb') as fout:
    pickle.dump(trainer, fout)
0
humble_fool On

I realized the issue is in the CRFTagger module. If I use a different initial tagger with Brill's, the error isn't produced and model gets saved.

trainer = nltk.tag.brill_trainer.BrillTaggerTrainer(baseline_tagger, templates)

I was unable to save the trained model when baseline_tagger was a CRFTagger() object. Using something like an NgramTagger solves the issue for some reason.