Build GoldDoc with a spacy offset format to train a blank model with CLI

214 views Asked by At

I'm currently doing NER with 3 Labels:

  • PERSON
  • PHONE
  • ADDRESS

I am able to train my model with python code but I want to use CLI Training which gives more flexibility.

I have converted my data to spacy offset training format which looks like :

[
    ["Bonjour\r\n\r\n\r\n\r\ncordialement, Thomas\r\n\r\n tel 0102030405",{"entities": [[70,79,"PHONE"],[56,61,"PER"]]}]
]

In order to use CLI to train/Evaluate my model I need to transform these data to a Gold format.

I'm already aware of below methods but it needs an existing nlp to be used:

doc = nlp(text)
tags = biluo_tags_from_offsets(doc, offsets)

My Question is : How can I convert spacy offset to gold if I need to create a model with specific LABELS.

1

There are 1 answers

6
aab On

You only need the model here for tokenization and sentence segmentation, so it would also work to say:

from spacy.lang.en import English
nlp = English()
nlp.add_pipe(nlp.create_pipe("sentencizer"))