I have a problem with my Python parsing. I have this kind of xml file:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Trans SYSTEM "trans-14.dtd">
<Trans scribe="maria" audio_filename="agora_2007_11_05_a" version="11" version_date="080826" xml:lang="catalan">
<Topics>
<Topic id="to1" desc="music"/>
<Topic id="to2" desc="bgnoise"/>
<Topic id="to4" desc="silence"/>
<Topic id="to5" desc="speech"/>
<Topic id="to6" desc="speech+music"/>
</Topics>
<Speakers>
<Speaker id="spk1" name="Xavi Coral" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
<Speaker id="spk2" name="Ferran Martínez" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
<Speaker id="spk3" name="Jordi Barbeta" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
</Speakers>
<Section type="report" topic="to6" startTime="111.286" endTime="119.308">
<Turn speaker="spk1" startTime="111.286" endTime="119.308" mode="planned" channel="studio">
<Sync time="111.286"/>
ha estat director del diari La Vanguàrdia,
<Sync time="113.56"/>
ha estat director general de Barcelona Televisió i director del Centre Territorial de Televisió Espanyola a Catalunya,
<Sync time="119.308"/>
actualment col·labora en el diari
<Event desc="es" type="language" extent="begin"/>
El Periódico
<Event desc="es" type="language" extent="end"/>
de Catalunya.
</Turn>
</Section>
And this is my Python code:
import xml.etree.ElementTree as etree
import os
import sys
xmlD = etree.parse(sys.stdin)
root = xmlD.getroot()
sections = root.getchildren()[2].getchildren()
for section in sections:
turns = section.getchildren()
for turn in turns:
speaker = turn.get('speaker')
mode = turn.get('mode')
childs = turn.getchildren()
for child in childs:
time = child.get('time')
opt = child.get('desc')
extent = child.get('extent')
if opt == 'es' and extent == 'begin':
opt = "ESP:"
elif opt == "la" extent == 'begin':
opt = "LAT:"
elif opt == "en" extent == 'begin':
opt = "ENG:"
else:
opt = ""
if time:
time = time
else:
time = ""
print time, opt+child.tail.encode('latin-1')
I need to mark the words pronounced in other language with this tag LANG:
For example:
spanish words ENG:hello, spanish words
, but when I have 2 consecutive words pronounced in other language I don't know how to do this: spanish words ENG:hello ENG:man, spanish words
. The change of language is in the Event
xml tag.
Now, at the Output I have:
actualment col·labora en el diari ESP:El Periódico de Catalunya.
and I want: actualment col·labora en el diari ESP:El ESP:Periódico de Catalunya.
Anyone could help me?
Thank you!
You can do something like -
instead of your
print
statement