How does one interpret the random forest classifier from sci-kit learn?

1.4k views Asked by At

I know little on how random forest works. Usually in classification I could fit the train data into the random forest classifier and ask to predict the test data.

Currently I am working on titanic data that is provided to me. This is a top rows of the data set and there are 1300(approx) rows.

survived pclass sex age sibsp parch fare embarked 0 1 1 female 29 0 0 211.3375 S 1 1 1 male 0.9167 1 2 151.55 S 2 0 1 female 2 1 2 151.55 S 3 0 1 male 30 1 2 151.55 S 4 0 1 female 25 1 2 151.55 S 5 1 1 male 48 0 0 26.55 S 6 1 1 female 63 1 0 77.9583 S 7 0 1 male 39 0 0 0 S 8 1 1 female 53 2 0 51.4792 S 9 0 1 male 71 0 0 49.5042 C 10 0 1 male 47 1 0 227.525 C 11 1 1 female 18 1 0 227.525 C 12 1 1 female 24 0 0 69.3 C 13 1 1 female 26 0 0 78.85 S

There is no test data given. So I want random forest to predict the survival on entire data set and compare it with actual value (more like checking the accuracy score).

So what I have done is divide my complete dataset into two parts; one with features and other one predict(survived). Features consists all the columns except survived and predict consists survived column.

dfFeatures = df['survived']
dfTarget = dfCopy.drop('survived', 1)

Note: df is the entire dataset.

Here is the code that checks the score of randomforest

rfClf = RandomForestClassifier(n_estimators=100, max_features=10)
rfClf = rfClf.fit(dfFeatures, dfTarget)
scoreForRf = rfClf.score(dfFeatures, dfTarget)

I get the score output with something like this

The accuracy score for random forest is :  0.983193277311

I am finding it little difficult to understand what is happening behind the code in above given code.

Does, it predict survival for all the tuples based upon other features (dfFeatures) and compare it with test data(dfTarget) and give the prediction score or does it randomly create train and test data based upon the train data provided and compare accuracy for test data it generated behind?

To be more precise, while calculating the accuracy score does it predict the survival for entire data set or just random partial data set?

1

There are 1 answers

3
Po Stevanus Andrianta On

Somehow i dont see you're trying to split the dataset into train and test

dfWithTestFeature = df['survived']

dfWithTestFeature contains only the column survived, which is the labels.

dfWithTrainFeatures = dfCopy.drop('survived', 1)

dfWithTrainFeatures contain all the feature (pclass, sex, age, etc).

and now jumping to the code,

rfClf = RandomForestClassifier(n_estimators=100, max_features=10)

the line above is creating the random forest classifier, n_estimator is depth of the tree, higher number of this will lead to overfit the data.

rfClf = rfClf.fit(dfWithTrainFeatures, dfWithTestFeature) 

line above is training process, the .fit() need 2 parameter, first for the feature, and second is the label (or target value which is the value from 'survived' column) from the features.

scoreForRf = rfClf.score(dfWithTrainFeatures, dfWithTestFeature)

.score() needs 2 parameter, 1st is features and 2nd is labels. This is for using the model that we created using the .fit() function to predict the features in 1st parameter, while the 2nd parameter will be validation value.

from what i see, you're using same data to train and test the model which is not good.

To be more precise, while calculating the accuracy score does it predict the survival for entire data set or just random partial data set?

you used all data to test the model.

I could use cross validation but then again question is do I have to for random forest? Also cross validation for random forest seems to be very slow

of course, you need to use validation to test your model. Create confusion matrix, count precision and recall, don't just depends on the accuracy.

if you think the model is running too slow, then decrease n_esimators value.