I'm very new to Python so I appreciate my approach is probably a bit rough and ready, but any help would be very welcome.
I'm looking for loop through a file of xml lines and parse the date in one of the tags. I have the elements working individually; I can read in the file, loop through it and write to an output file, and separately I can also take one line of the xml and parse it to extract the date. However when I try combine the two by reading in lines one by one and parsing them I'm getting the following error:
Traceback (most recent call last):
File "./sadpy10.py", line 19, in <module>
DOMTree = xml.dom.minidom.parse(line)
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 922, in parse
fp = open(file, 'rb')
IOError: [Errno 2] No such file or directory: '<Header><Version>1.0</Version>....<cd:Data>...</Data>..... <cd:DateReceived>20070620171524</cd:DateReceived>'
The initial input file (report2.out) is as follows, the other input file (parseoutput.out) just has the considerable whitespace at the end of each line removed, as I was getting an IO error saying the line was too long:
<Header><Version>1.0</Version>....<cd:Data>...</Data>.....<cd:DateReceived>20070620171524</cd:DateReceived>
<Header><Version>1.0</Version>....<cd:Data>...</Data>.....<cd:DateReceived>20140523012300</cd:DateReceived>
...
My Code is here:
#!/usr/bin/python
from xml.dom.minidom import parse
import xml.dom.minidom
import datetime
f = open('report2.out','r')
file = open("parseoutput.out", "w")
for line in f:
# I had to strip the whitespace from end of each line as I was getting error saying the lines were too long
line = line.rstrip()
file.write(line + '\n')
f.close()
file.close()
f = open("parseoutput.out","r")
for line in f:
DOMTree = xml.dom.minidom.parse(line)
collection = DOMTree.documentElement
get_date = collection.getElementsByTagName("cd:DateReceived").item(0).firstChild.nodeValue
get_date = datetime.datetime.strptime(get_date, "%Y%m%d%H%M%S").isoformat()
get_date = get_date.replace("T"," ")
print get_date
f.close()
Any help would be greatly appreciated.
xml.dom.minidom.parse
accepts either a filename or a file (or file-like object) as its first argument. Becauseparseoutput.out
contains separate XML documents on each line, this function won't work for you. Instead, usexml.dom.minidom.parseString
. It's a shortcut for creating aStringIO
object and passing it toparse
.