What are the steps needed to use Mahout Native Bayes Classifier Algorithm?

3.5k views Asked by At

I am trying to use Native Bayes Classifier in detecting fraud transactions. I have a sample data of around 5000 in an excel sheet, this is the data which I will use for training the classifier and i have test data of around 1000 on which I will apply test classifier.

Here my problem is, I dont know how to train the classifier. Do I need to transform my training data into some specific format before passing it into training classifier. How the training classifier will know which is my target value and which are its features.

Can someone please help me?

1

There are 1 answers

3
Next Door Engineer On

In order to test your data, you need to make sure your training set has some labels or has been divided into chunks based on some features that you used in your data collection set. I am unsure how you have organized your data, but you need to split your data set into chunks of similar features together.

Once you have created your splits based on your criteria, check the creation of your input data. You can verify files using:

hadoop fs -ls filename

Train your classifier using:

$MAHOUT_HOME/bin/mahout trainclassifier -i input_file -o output_model

Test the classifier using:

$MAHOUT_HOME/bin/mahout testclassifier -m output_model -d input_file 

NOTE: Please note that during data collection you need to make sure you assign weights for certain data values, if they exist. Also data cleaning has to be done for normalizing error during the experimental setup or data collection. You can use any multiplicative scatter correction techniques for your data set to correct it.

Firstly, have a file called training-categories.txt, that contains the categories for your classifier. You can use a simple text editor to do this.

Now that we have a list of categories we’re interested in, run the ExtractTrainingData class using the category list.

$TT_HOME/bin/tt extractTrainingData \
--dir ./index \
--categories ./training-categories.txt \
--output ./category-bayes-data \
--category-fields categoryFacet,source \
--text-fields title,description \
--tv

This command will read documents and search for matching categories in the category and source fields. When one of the categories listed in training-categories.txt is found in one of these documents, the terms will be extracted from term vectors stored in the title and description fields. These terms will be written to a file in the category-bayes-data directory. There will be a single file for each category. Each is a plain text file that can be viewed with any text editor or display utility.

The category name appears in the first column, while each of the terms that appear in the document is contained in the second column. The Mahout Bayes classifiers expect the input fields to be stemmed, so you will see this reflected in the test data. The --tv argument to the extractTraining data command causes the stemmed terms from each document’s term vector to be used.

When the ExtractTrainingData class has completed its run it will output a count of documents found in each category.