I am trying to parse a data intensive xml file. I am using lxml to parse each tag:
from lxml import etree
sourceFile=sys.argv[1]
events = ("start", "end")
context=etree.iterparse(sourceFile,events=events)
for eachEvent, eachElement in context:
<the code goes here>
I am facing issue with the below data:
<QualityData>
<Measure>Care for Older Adults - Functional Status Assessment</Measure>
<Question>Patients, ages 66 years or older, should have a functional status assessment completed every calendar year.</Question>
<Answer>Date:12/31/2019</Answer>
<SubAnswer2>Completed comprehensive functional status assessment (not limited to an acute or single condition, event, or body system) today</SubAnswer2>
<Measure>Care for Older Adults - Pain Screening</Measure>
<Question>Patients, ages 66 years or older, should have a pain assessment at least annually</Question>
<Answer>Date:12/31/2019</Answer>
<SubAnswer2>Comprehensive pain assessment (not limited to an acute or single condition, event, or body system) completed today</SubAnswer2>
</QualityData>
The tag SubAnswer2 has 2 occurrences in it. The 2nd occurrence is getting None value. Point to note is that the 2nd occurrence of other tags are getting read properly. Also, I am getting issue with only this data. There are other examples where the tag SubAnswer2 has multiple occurrences and they are getting parsed successfully. The code that I am using to read the value for Subanswer2 is:
if eachElement.tag=='SubAnswer2' and QDstart and eachEvent=='start':
QDlist.append(eachElement.text)
I tried to parse using ElementTree as well. However, I get None for some other tags when I use it. To debug the problem, I wrote a simple parse and print of the data. It looks like for the missing data, the eachElement.text gets value when the event is 'end'. The code that I used to print the data:
for eachEvent, eachElement in context:
print(eachElement.tag,eachEvent,eachElement.text,sep='::')
The output that I got:
QualityData::start::None
Measure::start::Care for Older Adults - Functional Status Assessment
Measure::end::Care for Older Adults - Functional Status Assessment
Question::start::Patients, ages 66 years or older, should have a functional status assessment completed every calendar year.
Question::end::Patients, ages 66 years or older, should have a functional status assessment completed every calendar year.
Answer::start::Date:12/31/2019
Answer::end::Date:12/31/2019
SubAnswer2::start::Completed comprehensive functional status assessment (not limited to an acute or single condition, event, or body system) today
SubAnswer2::end::Completed comprehensive functional status assessment (not limited to an acute or single condition, event, or body system) today
Measure::start::Care for Older Adults - Pain Screening
Measure::end::Care for Older Adults - Pain Screening
Question::start::Patients, ages 66 years or older, should have a pain assessment at least annually
Question::end::Patients, ages 66 years or older, should have a pain assessment at least annually
Answer::start::Date:12/31/2019
Answer::end::Date:12/31/2019
SubAnswer2::start::None
SubAnswer2::end::Comprehensive pain assessment (not limited to an acute or single condition, event, or body system) completed today
QualityData::end::None
Observe the text for SubAnswer2 when the event is 'end'. Is there something I can do to make sure that the text for a tag appears when the event is 'start'?
Thanks in advance.
It feels too complicated to use etree. Here's another way
Result:
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples