I have xml data, which looks like this:
<item n="main"><anchor type="b" ana="regO.lemID_12" xml:id="TidB13" />Stuttgart<anchor type="e" ana="reg0.lemID_12" xml:id="TidE13" /> d. 20. Sept [19]97<lb/>Lieber Herr Schmidt!<lb/>Ich bin sehr glücklich über die Aufnahme <anchor type="b" ana="regW.lemID_17" xml:id="TidB22" />meines <anchor type="b" ana="regP.lemID_4" xml:id="TidB4" />Shakespeare<anchor type="e" ana="regP.lemID_4" xml:id="TidE4" /><anchor type="e" ana="regW.lemID_17" xml:id="TidE22" /> bei euch, vielen Dank.</item>
I want to use texts like this as trainingdata in spacy, therfore i need it in the form spacy requieres:
doc = nlp("Laura flew to Silicon Valley.")
gold_dict = {"entities": [(0, 5, "PERSON"), (14, 28, "LOC")]}
example = Example.from_dict(doc, gold_dict)
Especially the creation of the offset, i.e. when an entity starts and when it ends, I still can't get it right. Is there a particularly suitable procedure for this?
Edit: here is what I have tried so far with ElementTree:
from xml.etree import ElementTree as ET
data = '''
<root>
<item n="main"><anchor type="b" ana="regO.lemID_12" xml:id="TidB13" />Stuttgart<anchor type="e" ana="reg0.lemID_12" xml:id="TidE13" /> d. 20. Sept [19]97<lb/>Lieber Herr Schmidt!<lb/>Ich bin sehr glücklich über die Aufnahme <anchor type="b" ana="regW.lemID_17" xml:id="TidB22" />meines <anchor type="b" ana="regP.lemID_4" xml:id="TidB4" />Shakespeare<anchor type="e" ana="regP.lemID_4" xml:id="TidE4" /><anchor type="e" ana="regW.lemID_17" xml:id="TidE22" /> bei euch, vielen Dank.</item>
</root>
'''
def get_entity_type(ana):
if 'regO' in ana:
return 'PLACE'
if 'regP' in ana:
return 'PERSON'
if 'regW' in ana:
return 'WORK'
if 'regP' in ana:
return "PERIODICA"
root = ET.fromstring(data)
print(root)
#text = ""
entities = []
current_pos = 0
for node in root.iter():
#print(node)
if node.tag == "anchor" and node.get('type')=='b':
start_pos = current_pos
ana = node.get('ana')
entity_type = get_entity_type(ana)
#print(entity_type)
elif node.tag == "anchor" and node.get('type')=='e':
entities.append((entity_type, start_pos, current_pos))
#print (entities)
So catching the entities-types is working, but the idea to catch the beginning and ending position of the entities is wrong. Also I tried to do it with pawpaw, described like here. But it always fails to find "Ito"
That's what I tried with pawpaw:
from pawpaw import ito
root = ET.fromstring(data)
elements = root.findall('.//')
print(elements)
for e in elements:
plain_text = e.Ito.find('*[d:text]')
# print(plain_text)
To grep the text you need element
.tail:Output: