Convert from Prodigy's JSONL format for labeled NER to spaCy's training format?

2.3k views Asked by At

I am new to Prodigy and spaCy as well as CLI coding. I'd like to use Prodigy to label my data for an NER model, and then use spaCy in python to create models.

Prodigy outputs in SQLite format. SpaCy takes in this other kind of format, not sure what to call it:

TRAIN_DATA = [
    (
        "Horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("Do they bite?", {"entities": []}),
    (
        "horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}),
    (
        "they pretend to care about your feelings, those horses",
        {"entities": [(48, 54, LABEL)]},
    ),
    ("horses?", {"entities": [(0, 6, LABEL)]}),
]

How can I convert from one to the other? It seems like this should be easy, but I cannot find it anywhere.

I have no problem loading in the dataset, just converting.

1

There are 1 answers

0
aab On BEST ANSWER

Prodigy should export this training format with data-to-spacy as of version 1.9: https://prodi.gy/docs/recipes#data-to-spacy