python sklearn cross_validation /number of labels does not match number of samples

5.2k views Asked by At

Doing a course on machine learning, and I want to split the data into train and test sets. I want to split it up, use Decisiontree on it for training, and then print out the score of my test set. The cross validation parameters in my code were given. Does anyone see what I did wrong?

The error I get is the following :

Traceback (most recent call last):
  File "/home/stephan/ud120-projects/validation/validate_poi.py", line 36, in <module>
    clf = clf.fit(features_train, labels_train)
  File "/home/stephan/.local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 221, in fit
    "number of samples=%d" % (len(y), n_samples))
ValueError: Number of labels=29 does not match number of samples=66

Here is my code:

import pickle
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") )

features_list = ["poi", "salary"]

data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)

from sklearn import tree
from sklearn import cross_validation

features_train, labels_train, features_test, labels_test = \
    cross_validation.train_test_split(features, labels, random_state=42, test_size=0.3)



clf = tree.DecisionTreeClassifier()
clf = clf.fit(features_train, labels_train)
print clf.score(features_test, labels_test)
2

There are 2 answers

0
Alexander On BEST ANSWER

Your variables don't appear to match the return pattern for train_test_split

Try:

features_train, features_test, labels_train, labels_test = ...
0
Aarif1430 On

You need to pass test_size = 0.5 in train_ test_split function

train_test_split(...,test_size=0.5,...)