Convert Prodigy JSONL / Spacy Doc format to CONLL

326 views Asked by At

I have been searching for a while now but haven't found any solution to my problem. For a relation classification task I have annotated several news like text documents with prodigy annotation software. Prodigy outputs the format in a JSONL file that can be converted into a .spacy file. In the JSONL format, each line represents one news article with its annotations.

Now I want to convert my annotations into a more standardized format like CONLL, so that I can work with my annotations with other open source software like Inception (Unfortunatly Prodigy has not been a good choice). Unfortunatly, I haven't found any lib, script or tool that can convert prodigy Jsonl/Spacy to CONLL.

Here is an example, how the prodigy JSONL format looks like:

{
  "text": "My mother’s name is Sasha Smith. She likes dogs and pedigree cats.",
  "tokens": [
    {"text": "My", "start": 0, "end": 2, "id": 0, "ws": true},
    {"text": "mother", "start": 3, "end": 9, "id": 1, "ws": false},
    {"text": "’s", "start": 9, "end": 11, "id": 2, "ws": true},
    {"text": "name", "start": 12, "end": 16, "id": 3, "ws": true },
    {"text": "is", "start": 17, "end": 19, "id": 4, "ws": true },
    {"text": "Sasha", "start": 20, "end": 25, "id": 5, "ws": true},
    {"text": "Smith", "start": 26, "end": 31, "id": 6, "ws": true},
    {"text": ".", "start": 31, "end": 32, "id": 7, "ws": true, "disabled": true},
    {"text": "She", "start": 33, "end": 36, "id": 8, "ws": true},
    {"text": "likes", "start": 37, "end": 42, "id": 9, "ws": true},
    {"text": "dogs", "start": 43, "end": 47, "id": 10, "ws": true},
    {"text": "and", "start": 48, "end": 51, "id": 11, "ws": true, "disabled": true},
    {"text": "pedigree", "start": 52, "end": 60, "id": 12, "ws": true},
    {"text": "cats", "start": 61, "end": 65, "id": 13, "ws": true},
    {"text": ".", "start": 65, "end": 66, "id": 14, "ws": false, "disabled": true}
  ],
  "spans": [
    {"start": 20, "end": 31, "token_start": 5, "token_end": 6, "label": "PERSON"},
    {"start": 43, "end": 47, "token_start": 10, "token_end": 10, "label": "NP"},
    {"start": 52, "end": 65, "token_start": 12, "token_end": 13, "label": "NP"}
  ],
  "relations": [
    {
      "head": 0,
      "child": 1,
      "label": "POSS",
      "head_span": {"start": 0, "end": 2, "token_start": 0, "token_end": 0, "label": null},
      "child_span": {"start": 3, "end": 9, "token_start": 1, "token_end": 1, "label": null}
    },
    {
      "head": 1,
      "child": 8,
      "label": "COREF",
      "head_span": {"start": 3, "end": 9, "token_start": 1, "token_end": 1, "label": null},
      "child_span": {"start": 33, "end": 36, "token_start": 8, "token_end": 8, "label": null}
    },
    {
      "head": 9,
      "child": 13,
      "label": "OBJECT",
      "head_span": {"start": 37, "end": 42, "token_start": 9, "token_end": 9, "label": null},
      "child_span": {"start": 52, "end": 65, "token_start": 12, "token_end": 13, "label": "NP"}
    }
  ]
}

Thanks in advance

I want to to convert either the prodigy jsonl into CONLL or the .spacy annotation file into conll

1

There are 1 answers

0
polm23 On

You can load in your spaCy Docs from the .spacy file and use spacy-conll to dump them as CoNLL files.