Here is the start of my config file:
# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = null
dev = null
vectors = "en_core_web_lg"
But I am training my model on my own labels and spans.
What are the vectors = "en_core_web_lg"
used for?
After all I am using the following logic to train my model:
# Load a new spacy model:
nlp = spacy.blank("en")
# Create a DocBin object:
db = DocBin()
for text, annotations in input: # Data in previous format
doc = nlp(text)
ents = []
spans = []
for start, end, label in annotations: # Add character indexes
spans.append(Span(doc, 0, len(doc), label=label))
span = doc.char_span(start, end, label=label)
ents.append(span)
doc.ents = ents # Label the text with the ents
group = SpanGroup(doc, name="sc", spans=spans)
doc.spans["sc"] = group
db.add(doc)
db.to_disk(output_path)
Please explain where these vectors are used in such configuration?
Consider I have a list of annotated data in the format of [(text_1, [(start_1, end_1, LABEL_1)]), (text_2, [(start_2, end_2, LABEL_1)]).....]
The static word vectors are included as a tok2vec feature if you have
include_static_vectors = true
in thetok2vec
config.