I have a CSV dataset which has four columns: "sentence", "term1", "term2", and "relation". The "sentence" column provides a sentence where there is a relation between term1 and term2. I then apply stanza.Pipeline()
from stanza library to process this dataset and I would like to store it in CoNLL-U format. Later on, this dataset will be used to train a model which can extract triples of a form <term1><relation type><term2> given a sentence.
What is the best practice for storing the term1, term2 and relation information in the ConLL-U format?
For example, given this row of data, where should the annotation for term1, term2 and relation be included in the CoNLL-U format?
A row from the CSV file:
"sentence", "term1", "term2", "relation"
"Ibuprofen helps with headaches.", "Ibuprofen", "headaches", "treat"
Is it fine to add this information in the miscellaneous field like below (tag=term1|relation=treat)?
# text = Ibuprofen helps with headaches.
# sent_id = 0
# constituency = (ROOT (S (NP (NNP Ibuprofen)) (VP (VBZ helps) (PP (IN with) (NP (NNS headaches)))) (. .)))
# sentiment = 0
1 Ibuprofen Ibuprofen PROPN NNP Number=Sing 2 nsubj _ tag=term1|relation=treat|start_char=0|end_char=9|ner=O
2 helps help VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ start_char=10|end_char=15|ner=O
3 with with ADP IN _ 4 case _ start_char=16|end_char=20|ner=O
4 headaches headache NOUN NNS Number=Plur 2 obl _ tag=term2|relation=treat|start_char=21|end_char=30|ner=O
5 . . PUNCT . _ 2 punct _ start_char=30|end_char=31|ner=O