I am new to Scikit Learn and for work I am working on a project involving multilabel classification on about 70000 webpages ~250MB file. Due to the size of the file, I have to use out of core classification. The labels for these pages are dmoz categories. Therefore, each page can have multiple labels.
I created the code below by adapting from the out of core example for scikit-learn. However, the code below prints out only one label for each document.
1)Is there someway in which I can print the top 5 labels for each document by probability? I will appreciate any pointers/modifications to the code.
2)What will be a good classifier which supports multilabel classification for this task, given OneVsRest doesn't provide a partial_fit method
The text inside file_training_combined.csv looks like the following
"http://home.earthlink.net/~rvbears/","RV Resources - Camping Information - RV Accessories","","","","","RV Resources - Camping Information - RV Accessories RV Resources\, Camping Resources\, Camping Information RV\, Camping Resources and Information! For Campers\, Travel Trailers\, Motorhome and Fifth Wheels Owners Camping Games Camping Recipes Camping Cooking Supplies RV Books RV E-Books RV Videos/DVD RV Links Looking for rv and camping information\, this is it! Check in here for lots of great resources and information especially for newbies. From Camping Gear\, to RV Books\, E-Books\, and Videos our pages are filled with information about everything to do with Camping and RVing to get you headed in the right direction\, from companies you can trust. Refer to the RV Links section for lots of camping gear and rv accessories\, find just about anything that you are looking for. Coming Back Soon....Our ""PRODUCT REVIEWS BLOG"" Will we be returning to reviewing our best bets on some of the newest camping gadgets for inside and outside your rv or tent. Emergency medical & travel assistance for less than 22 cents a day. Good Sam TravelAssist. Learn More! With over 2 million rescues and recoveries and counting\, Good Sam Roadside Assistance gives our members peace of mind when they travel. RV Accessories\, RV Decor\, RV Books\, RV E-books\, RV Videos\, RV DVDs RV Resources\, Camping Resources\, Camping Information NOTE: RV Ladders Bears are now SOLD OUT Home | Woodworking Links | Link To Us Copyright 2002-2014 GoCampin'. All Rights Reserved. Go Campin' ~ PO BOX 25417 ~ Greenville\, SC 29616-0417","/Top/Shopping/Crafts/Woodcraft/Decorative|/Top/Shopping/Crafts/Woodcraft/HomeDecor"
This is just one line from a CSV file. I am using the text which is in column 6 and labels are in column 7 seperated by |
import codecs
import itertools
import time
import csv
import sys
import re
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
__author__ = 'prateek.jain'
csv.field_size_limit(sys.maxsize)
sep = b","
quote_char = b'"'
stop = stopwords.words('english')
porter = PorterStemmer()
text_rows = []
text_labels = []
training_file_object = codecs.open('file_training_combined.csv','r', 'utf-8')
wr1 = csv.reader(training_file_object, dialect='excel', quotechar=quote_char, quoting=csv.QUOTE_ALL, delimiter=sep)
output_file = 'output.csv'
output_file_object = open(output_file, 'w')
for row in wr1:
text_rows.append(row[6])
labels = row[7].strip().split('|')
empty_list = []
for label in labels:
if not ('http:' in label.lower() or 'www:' in label.lower()):
empty_list.append(label)
text_labels.append(empty_list)
def tokenizer(text):
text = re.sub('<[^>]*>', '', text)
emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
text = [w for w in text.split() if w not in stop]
tokenized = [porter.stem(w) for w in text]
return text
# dialect='excel'
def stream_docs(path):
training_file_object = codecs.open(path, 'r', 'utf-8')
wr1 = csv.reader(training_file_object, dialect='excel', quotechar=quote_char, quoting=csv.QUOTE_ALL, delimiter=sep)
print(wr1.next())
for row in wr1:
text, label = row[6], row[7]
labels = label.split('|')
empty_list = []
for label in labels:
if not ('http:' in label.lower() or 'www:' in label.lower()):
empty_list.append(label)
yield text, empty_list
def get_minibatch(doc_stream, size):
docs, y = [], []
for _ in range(size):
text, label = next(doc_stream)
docs.append(text)
y.append(label)
return docs, y
from sklearn.feature_extraction.text import HashingVectorizer
vect = HashingVectorizer(decode_error='ignore',
n_features=2 ** 10,
preprocessor=None,
lowercase=True,
tokenizer=tokenizer,
non_negative=True, )
clf = MultinomialNB()
doc_stream = stream_docs(path='file_training_combined.csv')
merged = list(itertools.chain(*text_labels))
my_set = set(merged)
class_label_list = list(my_set)
all_class_labels = np.array(class_label_list)
mlb = MultiLabelBinarizer(all_class_labels)
X_test_text, y_test = get_minibatch(doc_stream, 1000)
X_test = vect.transform(X_test_text)
classes = np.array([0, 1])
tick = time.time()
accuracy = 0
total_fit_time = 0
n_train_pos = 0
for _ in range(45):
X_train, y_train = get_minibatch(doc_stream, size=1000)
X_train_matrix = vect.fit_transform(X_train)
y_train = mlb.fit_transform(y_train)
print X_train_matrix.shape, ' ', y_train.shape
clf.partial_fit(X_train_matrix.toarray(), y_train, classes=all_class_labels)
total_fit_time += time.time() - tick
n_train = X_train_matrix.shape[0]
n_train_pos += sum(y_train)
tick = time.time()
predicted = clf.predict(X_test)
all_labels = predicted
for item, labels in zip(X_train, all_labels):
print '%s => %s' % (item, labels)
output_file_object.write('%s => %s' % (item, labels) + '\n')
With only 250mb there is really no reason to go out of core. Or do you have less than 250mb of ram? For getting the top k predictions, you can use predict_proba or decision_function to get find how likely each label is.