How to parse xml-like text file in python?

84 views Asked by At

i have a text file in XML-like langage which look like this:

<StoryText>
                <DefaultStyle/>
                <para ALIGN="3" LINESP="10"/>
                <tab FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0"/>
                <ITEXT FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="TEXT"/>
                <ITEXT FONT="Times New Roman Bold" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="TEXT"/>
                <ITEXT FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="TEXT"/>
                
</StoryText>

My goal is to parse this file in python in order to be able to replace CH= attributes content with another TEXT.

Example :

> <ITEXT FONT="Times New Roman Bold" FONTSIZE="10" FEATURES="inherit"
> FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5"
> TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1"
> TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0"
> CH="**TEXT**"/>

transformed into

> <ITEXT FONT="Times New Roman Bold" FONTSIZE="10" FEATURES="inherit"
> FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5"
> TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1"
> TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0"
> CH="**REPLACEMENT TEXT**"/>

I tried to use xml.etree.ElementTree library with parse and getroot methods as usual but here i got error message:

xml.etree.ElementTree.ParseError: no element found

This message occurs apparently because the file is not in real XML, but a look alike.

Do you have an idea of how i could achieve this? NB : i'm not allowed to reformat the entry file by changing its structure because this is a scribus .sla file

My code:

import xml.etree.ElementTree as ET
cheminout = "file.sla"
tree = ET.parse(cheminout)  # error occurs here
root = tree.getroot()

The file .sla is several thousands lines long one, beginning with:

`<?xml version="1.0" encoding="UTF-8"?'

2

There are 2 answers

3
JB B On

I found the error : i was parsing path instead of actual file.

thanks for your time guys

1
Andrej Kesely On

Try:

import xml.etree.ElementTree as ET

xml_data = """\
<StoryText>
    <DefaultStyle/>
    <para ALIGN="3" LINESP="10"/>
    <tab FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0"/>
    <ITEXT FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="TEXT"/>
    <ITEXT FONT="Times New Roman Bold" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="TEXT"/>
    <ITEXT FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="TEXT"/>
</StoryText>"""


root = ET.fromstring(xml_data)

for elem in root.iter("ITEXT"):
    if "TEXT" == elem.get("CH"):
        elem.attrib["CH"] = "REPLACEMENT TEXT"

print(ET.tostring(root, encoding="utf-8").decode("utf-8"))

Prints:

<StoryText>
    <DefaultStyle />
    <para ALIGN="3" LINESP="10" />
    <tab FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" />
    <ITEXT FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="REPLACEMENT TEXT" />
    <ITEXT FONT="Times New Roman Bold" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="REPLACEMENT TEXT" />
    <ITEXT FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="REPLACEMENT TEXT" />
</StoryText>