I have trained a spancat model in spacy, it was trained successfully. Now when I run it on test data, it doesn't make any predictions.
Here are the training results:
This is how I am doing the predictions:
for text in df['text_cleaned']:
doc = nlp(text)
spans = doc.spans
When I look at the spans they are all empty: https://i.stack.imgur.com/b0tGu.png
Here is the cfg file I'm using:
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
[system]
gpu_allocator = null
seed = 444
[nlp]
lang = "en"
pipeline = ["tok2vec","spancat"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5
[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"
[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128
[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null
[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"
[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3
[initialize.components.spancat.labels]
@readers = "spacy.read_labels.v1"
path = null
require = true
[corpora]
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
max_epochs = 70
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0
[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001
[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
[initialize.tokenizer]
Originally, for my NER model, I had labeled examples in this format: ('Apple is a large company', {'entities': [(0, 4, 'ORG')]})
I fed a list of examples in this format into this function to convert them to spancat format:
def convert_to_docbin(input, output_path="./train.spacy", lang='en'):
""" Convert a pair of text annotations into DocBin then save """
# Load a new spacy model:
nlp = spacy.blank(lang)
# Create a DocBin object:
db = DocBin()
for text, annotations in input: # Data in previous format
doc = nlp(text)
ents = []
spans = []
for start, end, label in annotations['entities']: # Add character indexes
spans.append(Span(doc, 0, len(doc), label=label))
span = doc.char_span(start, end, label=label)
ents.append(span)
doc.ents = ents # Label the text with the ents
group = SpanGroup(doc, name="sc", spans=spans)
doc.spans["sc"] = group
db.add(doc)
db.to_disk(output_path)
convert_to_docbin(examples, output_path="/train.spacy", lang='en')
My NER model using the same training examples made a lot of predictions, so I'm wondering what's going on here that spancat doesn't seem to be working? Is my training data in the wrong format? Is my config off? Something else?