ElementTree errors, html files will not parse using Python/Sublime

1.4k views Asked by At

I am trying to parse a few thousand html files and dump the variables into a csv file (excel spreadsheet). I've come up against several roadblocks, but the first one is this: I can not get it to properly parse the file. Below is a brief explanation, the python code and the traceback info.

Using Python & Sublime to parse html files, I am getting several errors. What IS working: it runs fine up until if '.html' in file:. It does not execute that loop. It will iterate through print allFiles just fine. It also creates the csv file and creates the headers (though not in separate columns, but I can ask about that later).

It seems that the problem is in the if tree = ET.parse(HTML_PATH+"/"+file) piece. I've written this several different ways (without "/" and/or "file", for example)--so far I have yet to resolve this problem.

If I can provide more information or if anyone can direct me to other documenation, it would be greatly appreciated. So far I have yet to find anything that addresses this issue.

Many thanks for your thoughts.

//C

# Parses out data from crawled html files under "html files"
# and places the output in output.csv.

import xml.etree.ElementTree as ET
import csv, codecs, os
from cStringIO import StringIO
# Note: you need to download and install this..
import unicodecsv

 # TODO: make into command line params (instead of constant)
CSV_FILE='output.csv'
HTML_PATH='/Users/C/data/Folder_NS'
f = open(CSV_FILE, 'wb')
w = unicodecsv.writer(f, encoding='utf-8', delimiter=';')
w.writerow(['file', 'category', 'about', 'title', 'subtitle', 'date', 'bodyarticle'])

# redundant declarations:
category=''
about=''
title=''
subtitle=''
date=''
bodyarticle=''
print "headers created"

allFiles = os.listdir(HTML_PATH)
#with open(CSV_FILE, 'wb') as csvfile:
print "all defined"

for file in allFiles:
    #print allFiles
    if '.html' in file:
        print "in html loop"
        tree = ET.parse(HTML_PATH+"/"+file)
        print '===================='
        print 'Parsing file: '+file
        print '===================='
        for node in tree.iter():
            print "tbody"
            # The tbody attribute spells it all (or does it):
            name = node.attrib.get('/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font')

            # Check common header stuff
            if name=='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
                #print '    ------------------'
                #print '  Category:'
                category=node.text
                print "category"

f.close()

Traceback:

File "/Users/C/data/Folder_NS/data_parse.py", line 34, in tree = ET.parse(HTML_PATH+"/"+file) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1182, in parse tree.parse(source, parser) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse parser.feed(data) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed self._raiseerror(v) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror raise err xml.etree.ElementTree.ParseError: mismatched tag: line 63, column 2

1

There are 1 answers

2
William Jackson On BEST ANSWER

You are trying to parse HTML with an XML parser, and valid HTML is not always valid XML. You would be better off using the HTML parsing library in the lxml package.

import xml.etree.ElementTree as ET
# ...
tree = ET.parse(HTML_PATH + '/' + file)

would be changed to

import lxml.html
# ...
tree = lxml.html.parse(HTML_PATH + '/' + file)