I currently have a CSV export from our ticketing system with two columns.
Short Description and Class.
Both are created by the agent when logging a ticket. eg
- Data Backup is not working,Backup
- Email change in Groups,Notes
- backup directory not found,Backup
- Email > Global - Lotus Notes,Notes
I have been asked to write a Naive Bayes program using Python that will read the short description in a CSV file and then decide how it should be classified.
I have 329 tickets that have been classified into 6 different classes.
The following is a count of each:
- Class1 60
- Class2 77
- Class3 65
- Class4 16
- Class5 18
- Class6 93
I was thinking I would have to create 6 different dictionaries (one for each class) containing all the words used in the short description, excluding the usual !"£$%^&*()<>,./?:;@'#~][{}
Then when I run the program it will tokenize the short description using nltk and compare it to all the dictionaries and whatever one has the highest matches will determine the class.
Am I going about this the right way? How many tickets should I be using for my sample?
The following is what I have at the moment. It basically runs through a csv file named after a class and then outputs another file with punctuation removed, all the words in lower case and in separate cells. This data will then be used as a dictionary. I'm not sure if I'm going about this whole thing the right way though.
import csv
from nltk.tokenize import RegexpTokenizer
#Read CSV
readFile = open ('Backup.csv', 'r')
csv.readFile = csv.reader(readFile)
resultFile = open ('result.csv', 'w')
wr = csv.writer(resultFile)
#removes punctuation
tokenizer = RegexpTokenizer(r'\w+')
#for every row in file tokenize and covert to lowercase
#write tokenized words to a .csv file.
for row in csv.readFile:
wr.writerow(tokenizer.tokenize(row[0].lower()))
readFile.close()
resultFile.close()
EDIT: I have now started using the following which takes in the data from my two column csv file:
from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob
with open('train.csv', 'r') as fp:
cl = NaiveBayesClassifier(fp, format="csv")
print(cl.classify("backup")) # "Backup"
print(cl.classify("Lotus Notes.")) #"Lotus"
etc..
Pretty sure I just need to get a better sample size of training and test data and then I will feed in a csv of short descriptions then update it with the class that has been calculated.
From a functionality point of view it seems to work unless I've made any glaring mistakes?