I have an xml
file of the form:
<NewDataSet>
<Root>
<Phonemic>and</Phonemic>
<Phonetic>nd</Phonetic>
<Description/>
<Start>0</Start>
<End>8262</End>
</Root>
<Root>
<Phonemic>comfortable</Phonemic>
<Phonetic>comfetebl</Phonetic>
<Description>adj</Description>
<Start>61404</Start>
<End>72624</End>
</Root>
</NewDataSet>
I need to process it so that, for instance, when the user inputs nd
, the program matches it with the <Phonetic>
tag and returns and
from the <Phonemic>
part. I thought maybe if I can convert the xml file to a dictionary, I would be able to iterate over the data and find information when needed.
I searched and found xmltodict which is used for the same purpose:
import xmltodict
with open(r'path\to\1.xml', encoding='utf-8', errors='ignore') as fd:
obj = xmltodict.parse(fd.read())
Running this gives me an ordered dict
:
>>> obj
OrderedDict([('NewDataSet', OrderedDict([('Root', [OrderedDict([('Phonemic', 'and'), ('Phonetic', 'nd'), ('Description', None), ('Start', '0'), ('End', '8262')]), OrderedDict([('Phonemic', 'comfortable'), ('Phonetic', 'comfetebl'), ('Description', 'adj'), ('Start', '61404'), ('End', '72624')])])]))])
Now this unfortunately hasn't made things simpler and I am not sure how to go about implementing the program with the new data structure. For example to access nd
I'd have to write:
obj['NewDataSet']['Root'][0]['Phonetic']
which is ridiculously complicated. I tried to make it into a regular dictionary by dict()
but as it is nested, the inner layers remain ordered and my data is so big.
If you are accessing this as
obj['NewDataSet']['Root'][0]['Phonetic']
, IMO, you are not doing it right.Instead, you can do the following
Even though this code looks much more longer, the advantage is that it will be lot more compact and modular once you start dealing with sufficiently large xml.
PS: I had the same issues with
xmltodict
. But instead of parsing using xml.etree.ElementTree to parse the xml files, xmltodict was much easier to work with as the code base was smaller, and I didn't have to deal with other inanities of the xml module.EDIT
Following code works for me