dealing with dimensions in scikit-learn tree.decisiontreeclassifier

1.7k views Asked by At

I am trying to do a decision tree using scikit-learn with three dimensional training data and two dimensional target data. As a simple example, imagine an rgb image. lets say my target data is 1's and 0's, where 1's represent the presence of a human face, and 0's represent the absence. Take for example:

red         green        blue        face presence  

1000        0001         0011        0000    
0110        0110         0001        0110    
0110        0110         0000        0110     

An array of the rgb data would represent the training data, and the 2d array would represent my target classes (face, no-face).

In Python these arrays may look like:

rgb = np.array([[[1,0,0,0],[0,1,1,0],[0,1,1,0]],
               [[0,0,0,1],[0,1,1,0],[0,1,1,0]],
               [[0,0,1,1],[0,0,0,1],[0,0,0,0]]])

face = np.array([[0,0,0,0],[0,1,1,0],[0,1,1,0]])

Unfortunately, this doesn't work

import numpy as np
from sklearn import tree
dt_clf = tree.DecisionTreeClassifier()
dt_clf = dt_clf.fit(rgb, face)

This throws this error:

Found array with dim 3. Expected <= 2

I have tried reshaping and flattening the data several ways and get another error:

Number of labels=xxx does not match number of samples

Does anyone know how I can use tree.DecisionTreeClassifier to accomplish this? Thanks.

1

There are 1 answers

1
user14241 On BEST ANSWER

I think I have figured it out. It's not very pretty. Maybe someone can offer some help cleaning up the code. Basically, I needed to organize the rgb data to be an array of 12 3-element arrays, or shape=(12,3). For example...

np.hsplit(np.dstack(rgb).flatten(), len(face.flatten()))

I also flatten the face data, so my final fit call becomes...

dt_clf = dt_clf.fit(np.hsplit(np.dstack(rgb).flatten(), len(face.flatten())), 
                    face.flatten())

Now I can test a new dataset and see if it works. The target image indicated face presence when both red and green pixels were shown, so a good test might be...

red         green        blue 

1100        1100         0011  
1100        1100         0001  
0000        0000         0000

or...

predict = np.array([[[1,1,0,0],[1,1,0,0],[0,0,0,0]],
                    [[1,1,0,0],[1,1,0,0],[0,0,0,0]],
                    [[0,0,1,1],[0,0,0,1],[0,0,0,0]]])

so...

predicted = dt_clf.predict(np.hsplit(np.dstack(predict).flatten(),
                           len(face.flatten())))

and to get it back in the proper dimensions...

predicted = np.array(np.hsplit(predicted, face.shape[0]))

which yields us

array([[1, 1, 0, 0],
       [1, 1, 0, 0],
       [0, 0, 0, 0]])

Wonderful! Now to see if this works on something bigger. Please feel free to offer suggestions to make this cleaner.