I have a Python 2.7 project where I want to classify websites based on their content. I have a database in which I numerous website URLs and their associated category. There are many categories (= labels), and I wish to classify new sites into the corresponding category based on their content. I've been following the NLTK classification tutorial/example listed here, but have run into some problems I cannot explain.
Here is an outline of the process that I use:
- Use MySQLdb to retrieve the category associated with a given website URL. This will be used when extracting data (content) from the URL to pair it with the category (= label) of the site.
- Use a getSiteContent(site) function to extract the content from a website
The above function looks like this:
def getSiteContent(site):
try:
response = urllib2.urlopen(site, timeout = 1)
htmlSource = response.read()
except Exception as e: # <=== some websites may be inaccessible as list isn't up-to-date
global errors
errors += 1
return ''
soup = BeautifulSoup(htmlSource)
for script in soup.find_all('script'):
script.extract()
commonWords = set(stopwords.words('english'))
commonWords.update(['function', 'document', 'window', 'functions', 'getElementsByTagName', 'parentNode', 'getDocumentById', 'javascript', 'createElement', 'Copyright', 'Copyrights', 'respective', 'owners', 'Contact Us', 'Mobile Version', 'FAQ', 'Privacy Policy', 'Terms of Service', 'Legal Disclaimer' ])
text = soup.get_text()
# Remove ',', '/', '%', ':'
re.sub(r'(\\d+[,/%:]?\\d*)', '', text)
# Remove digits
re.sub(r'\d+', '', text)
# Remove non-ASCII
re.sub(r'[^\x00-\x7F]',' ', text)
# Remove stopwords
for word in commonWords :
text = text.replace(' '+word+' ', ' ')
# Tokenize the site content using NLTK
tokens = word_tokenize(text)
# We collect some word statistics, i.e. how many times a given word appears in the text
counts = defaultdict(int)
for token in tokens:
counts[token] += 1
features = {}
# Get rid of words that appear less than 3 times
for word in tokens:
if counts[word] >= 3 :
features['count(%s)' % word] = counts[word]
return features
When all the above is done, I do the following:
train = getTrainingSet(n)
random.shuffle(train)
Where n is the number of sites I wish to train my model against.
Afterwards, I do:
feature_set = []
count = 0
for (site, category) in train:
result = getSiteContent(site)
count += 1
if result != '':
print "%d. Got content for %s" % (count, site)
feature_set.append((result, category))
else :
print "%d. Failed to get content for %s" % (count, site)
The print statements are mainly for debugging purposes at this time. After I do the above, feature_set
contains something similar to the following:
print feature_set
[({u'count(import)': 22, u'count(maxim)': 22, u'count(Maxim)': 5, u'count(css)': 22, u'count(//www)': 22, u'count(;)': 22, u'count(url)': 22, u'count(Gift)': 3, u"count('')": 44, u'count(http)': 22, u'count(&)': 3, u'count(ng16ub)': 22, u'count(STYLEThe)': 3, u'count(com/modules/system/system)': 4, u'count(@)': 22, u'count(?)': 22}, 'Arts & Entertainment'), ({u'count(import)': 3, u'count(css)': 3, u'count(\u05d4\u05d9\u05d5\u05dd)': 4, u'count(\u05de\u05d9\u05dc\u05d5\u05df)': 6, u'count(;)': 3, u'count(\u05e2\u05d1\u05e8\u05d9)': 4, u'count(\u05d0\u05ea)': 3, u'count(\u05de\u05d5\u05e8\u05e4\u05d9\u05e7\u05e1)': 6, u"count('')": 6, u'count(\u05d4\u05d5\u05d0)': 3, u'count(\u05e8\u05d1\u05de\u05d9\u05dc\u05d9\u05dd)': 3, u'count(ver=01122014_4)': 3, u'count(|)': 4, u'count(``)': 4, u'count(@)': 3, u'count(?)': 7}, 'Miscellaneous')]
Afterwards, I try to train my classifier and then run it against the test data that I extract from feature_set
train_set, test_set = feature_set[len(train)/2:], feature_set[:len(train)/2]
print "Num in train_set: %d" % len(train_set)
print "Num in test_set: %d" % len(test_set)
classifier = nltk.NaiveBayesClassifier.train(train_set) # <=== classified declared on train_set
print classifier.show_most_informative_features(5)
print "=== Classifying a site ==="
print classifier.classify(getSiteContent("http://www.mangaspoiler.com"))
print "Non-working sites: %d" % errors
print "Classifier accuracy: %d" % nltk.classify.accuracy(classifier, test_set)
This is pretty much exactly how the tutorial on the NLTK documentation website does it. However, the results are the following (given a set of 100 websites):
$ python classify.py
Num in train_set: 23
Num in test_set: 50
Most Informative Features
count(Pizza) = None Arts & : Techno = 1.0 : 1.0
None
=== Classifying a site ===
Technology & Computing
Non-working sites: 27
Classifier accuracy: 0
Now, there are obviously a few problems with this:
The word tokens contain unicode characters such as
\u05e2\u05d1\u05e8\u05d9
, as it seems that the regex for removing them only works if they are standalone. This is a minor problem.A bigger problem is that even when I
print
thefeature_set
, the word tokens are displayed asu'count(...)' = #
as opposed to'count(...)' = #
. I think this may be a bigger issue and part of why my classifier is failing.The classifier is, obviously, failing catastrophically as some point. The accuracy is listed as
0
even if I feed my entire dataset into the classifier, which seems extremely unlikely.The
Most Informative Features
function says thatcount(Pizza) = None
. The code where I declaredefaultdict(int)
, however, requires that every entry be associated with the number of appearances in the text.
I am at quite a loss as to why this happens. As far as I can tell, my data is structured identically to the data that the NLTK documentation uses in its tutorial on the website I linked at the top of this question. If anyone who has worked with NLTK has seen this behaviour before, I would greatly appreciate any tips as to what I could be doing wrong.
There are probably many errors here, but the first and most obvious one stands out here:
It's not listed as
0.0
? It sounds like something in there that ought to be afloat
is anint
. I suspect you're doing division at some point for a normalization, and theint/int
isn't getting converted intofloat
.While building your count table, add
1.0
for each count, not1
. That will fix the source of the problem, and the corrections will trickle-down.If it seems strange to count documents with floats, think of each count as a measurement in the scientific sense of the word rather than a representation of a discrete document.