I am trying to structure data out of a text file into a XML file tagging parts of the text that I want to mark with XML taggers.

The Problem. xml.etree.ElementTree does not recognise the string

The code so far.

import xml.etree.ElementTree as ET
with open('input/application_EN.txt', 'r') as f:

The first thing I want to do is to tag the paragraphs. the text should look like:

    <paragraph id=1>
    <paragraph id=2>

so far I coded:

# splitting the text into paragraphs
list_of_paragraphs = application_text.splitlines()
# creating a new list where no_null paragraphs will be added

# counter of paragraphs of the XML file

# Create the XML file with the paragraphs
for i,paragraph in enumerate(list_of_paragraphs):
 # Adding only the paragraphs different than ''
    if paragraph != '':
        j = j + 1
        # be careful with the space after and before the tag. 
        # Adding the XML tags per paragraph
        xml_element = '<paragraph id=\"' + str(j) +'\">' + paragraph.strip() + ' </paragraph>'

# Now I pass the whole string to the XML constructor
root = ET.fromstring(description_text)

I get this error:

not well-formed (invalid token): line 1, column 6

After some investigation I realised that the error is given by the fact that the text contains the symbol "&". Adding and taking out "&" in several places confirms that.

The question is why? why is "&" not treated as text. What can I do?

I know I could replace all "&" but then I will loose information since "& Co." is a string quite important. I would like the text to stay intact. (no changing content).



EDIT: IN order to make it easier here you have the beginner of the text I am working on (instead of open a file you might be add this to check it):

Has all kind of kind of references. also measures.

Photovoltaic solar cells for directly converting radiant energy from the sun into electrical energy are well known. The manufacture of photovoltaic solar cells involves provision of semiconductor substrates in the form of sheets or wafers having a shallow p-n junction adjacent one surface thereof (commonly called the "front surface"). Such substrates may include an insulating anti-reflection ("AR") coating on their front surfaces, and are sometimes referred to as "solar cell wafers". The anti-reflection coating is transparent to solar radiation. In the case of silicon solar cells, the AR coating is often made of silicon nitride or an oxide of silicon or titanium. Such solar cells are manufactured and sold by E.I. duPont de Nemeurs & Co.'

As you see at the end there is a symbol "& Co." which causes trouble.

1 Answers

Berlines On

from: & Symbol causing error in XML Code

Some characters have special meaning in XML and ampersand (&) is one of them. Consequently, these characters should be substituted (ie use string replacement) with their respective entity references. Per the XML specification, there are 5 predefined entities in XML:

&lt;    <   less than
&gt;    >   greater than
&amp;   &   ampersand 
&apos;  '   apostrophe
&quot;  "   quotation mark

thanks @fallenreaper for pointing me towards BS to create XML files.