xml parsing result in tag with None value for some tag

134 views Asked by At

I am trying to parse a data intensive xml file. I am using lxml to parse each tag:

from lxml import etree
sourceFile=sys.argv[1]
events = ("start", "end")
context=etree.iterparse(sourceFile,events=events)
for eachEvent, eachElement in context:
    <the code goes here>

I am facing issue with the below data:

<QualityData>
        <Measure>Care for Older Adults - Functional Status Assessment</Measure>
        <Question>Patients, ages 66 years or older, should have a functional status assessment completed every calendar year.</Question>
        <Answer>Date:12/31/2019</Answer>
        <SubAnswer2>Completed comprehensive functional status assessment (not limited to an acute or single condition, event, or body system) today</SubAnswer2>
        <Measure>Care for Older Adults - Pain Screening</Measure>
        <Question>Patients, ages 66 years or older, should have a pain assessment at least annually</Question>
        <Answer>Date:12/31/2019</Answer>
        <SubAnswer2>Comprehensive pain assessment (not limited to an acute or single condition, event, or body system) completed today</SubAnswer2>
    </QualityData>

The tag SubAnswer2 has 2 occurrences in it. The 2nd occurrence is getting None value. Point to note is that the 2nd occurrence of other tags are getting read properly. Also, I am getting issue with only this data. There are other examples where the tag SubAnswer2 has multiple occurrences and they are getting parsed successfully. The code that I am using to read the value for Subanswer2 is:

if eachElement.tag=='SubAnswer2' and QDstart and eachEvent=='start':
    QDlist.append(eachElement.text)

I tried to parse using ElementTree as well. However, I get None for some other tags when I use it. To debug the problem, I wrote a simple parse and print of the data. It looks like for the missing data, the eachElement.text gets value when the event is 'end'. The code that I used to print the data:

for eachEvent, eachElement in context:
print(eachElement.tag,eachEvent,eachElement.text,sep='::')

The output that I got:

QualityData::start::None
Measure::start::Care for Older Adults - Functional Status Assessment
Measure::end::Care for Older Adults - Functional Status Assessment
Question::start::Patients, ages 66 years or older, should have a functional status assessment completed every calendar year.
Question::end::Patients, ages 66 years or older, should have a functional status assessment completed every calendar year.
Answer::start::Date:12/31/2019
Answer::end::Date:12/31/2019
SubAnswer2::start::Completed comprehensive functional status assessment (not limited to an acute or single condition, event, or body system) today
SubAnswer2::end::Completed comprehensive functional status assessment (not limited to an acute or single condition, event, or body system) today
Measure::start::Care for Older Adults - Pain Screening
Measure::end::Care for Older Adults - Pain Screening
Question::start::Patients, ages 66 years or older, should have a pain assessment at least annually
Question::end::Patients, ages 66 years or older, should have a pain assessment at least annually
Answer::start::Date:12/31/2019
Answer::end::Date:12/31/2019
SubAnswer2::start::None
SubAnswer2::end::Comprehensive pain assessment (not limited to an acute or single condition, event, or body system) completed today
QualityData::end::None

Observe the text for SubAnswer2 when the event is 'end'. Is there something I can do to make sure that the text for a tag appears when the event is 'start'?

Thanks in advance.

1

There are 1 answers

1
dabingsou On

It feels too complicated to use etree. Here's another way

from simplified_scrapy import SimplifiedDoc
html = '''
<QualityData>
    <Measure>Care for Older Adults - Functional Status Assessment</Measure>
    <Question>Patients, ages 66 years or older, should have a functional status assessment completed every calendar year.</Question>
    <Answer>Date:12/31/2019</Answer>
    <SubAnswer2>Completed comprehensive functional status assessment (not limited to an acute or single condition, event, or body system) today</SubAnswer2>
    <Measure>Care for Older Adults - Pain Screening</Measure>
    <Question>Patients, ages 66 years or older, should have a pain assessment at least annually</Question>
    <Answer>Date:12/31/2019</Answer>
    <SubAnswer2>Comprehensive pain assessment (not limited to an acute or single condition, event, or body system) completed today</SubAnswer2>
</QualityData>
'''
doc = SimplifiedDoc(html)
nodes = doc.QualityData.children
for node in nodes:
    print('{:<15}{}'.format(node.tag, node.text))

# Or
print (doc.SubAnswer2s.text)

Result:

Measure        Care for Older Adults - Functional Status Assessment
Question       Patients, ages 66 years or older, should have a functional status assessment completed every calendar year.
Answer         Date:12/31/2019
SubAnswer2     Completed comprehensive functional status assessment (not limited to an acute or single condition, event, or body system) today
Measure        Care for Older Adults - Pain Screening
Question       Patients, ages 66 years or older, should have a pain assessment at least annually
Answer         Date:12/31/2019
SubAnswer2     Comprehensive pain assessment (not limited to an acute or single condition, event, or body system) completed today
['Completed comprehensive functional status assessment (not limited to an acute or single condition, event, or body system) today', 'Comprehensive pain assessment (not limited to an acute or single condition, event, or body system) completed today']

Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples