Using NLTK to perform document classification on website content issues with BeautifulSoup and NaiveBayes

I have a Python 2.7 project where I want to classify websites based on their content. I have a database in which I numerous website URLs and their associated category. There are many categories (= labels), and I wish to classify new sites into the corresponding category based on their content. I've been following the NLTK classification tutorial/example listed here, but have run into some problems I cannot explain.

Here is an outline of the process that I use:

  1. Use MySQLdb to retrieve the category associated with a given website URL. This will be used when extracting data (content) from the URL to pair it with the category (= label) of the site.
  2. Use a getSiteContent(site) function to extract the content from a website

The above function looks like this:

def getSiteContent(site):
        response = urllib2.urlopen(site, timeout = 1)
        htmlSource =
    except Exception as e: # <=== some websites may be inaccessible as list isn't up-to-date
        global errors
        errors += 1
        return ''

    soup = BeautifulSoup(htmlSource)
    for script in soup.find_all('script'):

    commonWords = set(stopwords.words('english'))
    commonWords.update(['function', 'document', 'window', 'functions',     'getElementsByTagName', 'parentNode', 'getDocumentById', 'javascript', 'createElement',     'Copyright', 'Copyrights', 'respective', 'owners', 'Contact Us', 'Mobile Version', 'FAQ',     'Privacy Policy', 'Terms of Service', 'Legal Disclaimer' ])

    text = soup.get_text()

    # Remove ',', '/', '%', ':'
    re.sub(r'(\\d+[,/%:]?\\d*)', '', text)
    # Remove digits
    re.sub(r'\d+', '', text)
    # Remove non-ASCII
    re.sub(r'[^\x00-\x7F]',' ', text)
    # Remove stopwords
    for word in commonWords :
        text = text.replace(' '+word+' ', ' ')

    # Tokenize the site content using NLTK
    tokens = word_tokenize(text)

    # We collect some word statistics, i.e. how many times a given word appears in the     text
    counts = defaultdict(int)
    for token in tokens:
        counts[token] += 1

    features = {}
    # Get rid of words that appear less than 3 times
    for word in tokens:
        if counts[word] >= 3 :
            features['count(%s)' % word] = counts[word]

    return features 

When all the above is done, I do the following:

train = getTrainingSet(n)

Where n is the number of sites I wish to train my model against.

Afterwards, I do:

feature_set = []
count = 0
for (site, category) in train:
    result = getSiteContent(site)
    count += 1
    if result != '':
        print "%d. Got content for %s" % (count, site)
        feature_set.append((result, category))
    else  :
        print "%d. Failed to get content for %s" % (count, site)

The print statements are mainly for debugging purposes at this time. After I do the above, feature_set contains something similar to the following:

print feature_set
[({u'count(import)': 22, u'count(maxim)': 22, u'count(Maxim)': 5, u'count(css)': 22, u'count(//www)': 22, u'count(;)': 22, u'count(url)': 22, u'count(Gift)': 3, u"count('')": 44, u'count(http)': 22, u'count(&)': 3, u'count(ng16ub)': 22, u'count(STYLEThe)': 3, u'count(com/modules/system/system)': 4, u'count(@)': 22, u'count(?)': 22}, 'Arts & Entertainment'), ({u'count(import)': 3, u'count(css)': 3, u'count(\u05d4\u05d9\u05d5\u05dd)': 4, u'count(\u05de\u05d9\u05dc\u05d5\u05df)': 6, u'count(;)': 3, u'count(\u05e2\u05d1\u05e8\u05d9)': 4, u'count(\u05d0\u05ea)': 3, u'count(\u05de\u05d5\u05e8\u05e4\u05d9\u05e7\u05e1)': 6, u"count('')": 6, u'count(\u05d4\u05d5\u05d0)': 3, u'count(\u05e8\u05d1\u05de\u05d9\u05dc\u05d9\u05dd)': 3, u'count(ver=01122014_4)': 3, u'count(|)': 4, u'count(``)': 4, u'count(@)': 3, u'count(?)': 7}, 'Miscellaneous')]

Afterwards, I try to train my classifier and then run it against the test data that I extract from feature_set

train_set, test_set = feature_set[len(train)/2:], feature_set[:len(train)/2]
print "Num in train_set: %d" % len(train_set)
print "Num in test_set: %d" % len(test_set)
classifier = nltk.NaiveBayesClassifier.train(train_set) # <=== classified declared on train_set
print classifier.show_most_informative_features(5)
print "=== Classifying a site ==="
print classifier.classify(getSiteContent(""))
print "Non-working sites: %d" % errors
print "Classifier accuracy: %d" % nltk.classify.accuracy(classifier, test_set)

This is pretty much exactly how the tutorial on the NLTK documentation website does it. However, the results are the following (given a set of 100 websites):

$ python
Num in train_set: 23
Num in test_set: 50
Most Informative Features
            count(Pizza) = None           Arts & : Techno =      1.0 : 1.0
=== Classifying a site ===
Technology & Computing
Non-working sites: 27
Classifier accuracy: 0

Now, there are obviously a few problems with this:

  1. The word tokens contain unicode characters such as \u05e2\u05d1\u05e8\u05d9, as it seems that the regex for removing them only works if they are standalone. This is a minor problem.

  2. A bigger problem is that even when I print the feature_set, the word tokens are displayed as u'count(...)' = # as opposed to 'count(...)' = #. I think this may be a bigger issue and part of why my classifier is failing.

  3. The classifier is, obviously, failing catastrophically as some point. The accuracy is listed as 0 even if I feed my entire dataset into the classifier, which seems extremely unlikely.

  4. The Most Informative Features function says that count(Pizza) = None. The code where I declare defaultdict(int), however, requires that every entry be associated with the number of appearances in the text.

I am at quite a loss as to why this happens. As far as I can tell, my data is structured identically to the data that the NLTK documentation uses in its tutorial on the website I linked at the top of this question. If anyone who has worked with NLTK has seen this behaviour before, I would greatly appreciate any tips as to what I could be doing wrong.


There are probably many errors here, but the first and most obvious one stands out here:

The accuracy is listed as 0 even if I feed my entire dataset into the classifier

It's not listed as 0.0? It sounds like something in there that ought to be a float is an int. I suspect you're doing division at some point for a normalization, and the int/int isn't getting converted into float.

While building your count table, add 1.0 for each count, not 1. That will fix the source of the problem, and the corrections will trickle-down.

If it seems strange to count documents with floats, think of each count as a measurement in the scientific sense of the word rather than a representation of a discrete document.