Dealing with Sparse Matrices and multiple numerical features when training algorithm

474 views Asked by At

I have a data frame that looks as follows:

                                    description      priority  CDT  JDT  
0  Create Help Index Fails with seemingly incorre...       P3    0    0       
1  Internal compiler error when compiling switch ...       P3    0    1       
2  Default text sizes in org.eclipse.jface.resour...       P3    0    0       
3  [Presentations] [ViewMgmt] Holding mouse down ...       P3    0    0       
4  Parsing of function declarations in stdio.h is...       P2    1    0       

PDE  Platform  Web Tools  priorityLevel  
0         0          0              2  
1         0          0              2  
2         1          0              2  
3         1          0              2  
4         0          0              1  

I am currently trying to train an ML algorithm that would take the text in 'description' along with the rest of the numerical features except for 'priority' (discarded) and 'priorityLevel' (true labels).

This is basically an NLP application. The issue I'm having is that 'description' must first go through a 'CountVectorizer()' function:

X = df['description'] cv = CountVectorizer() X = cv.fit_transform(X)

The output that returns is incompatible with the rest of the data frame when I go to pass it to the training algorithm.

I need to be able to combine X after it has been vectorized, along with df[['CDT', 'JDT', 'PDE', 'Platform', 'Web Tools']] into a single variable in order to split and train:

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state=101)

nb = MultinomialNB() nb.fit(X_train, y_train)

In essence, X should contain the vectorized text, along with the numerical variables. All efforts thus far have failed.

I have tried to do through a pipeline as well:

pipeline = Pipeline([ ('bow', CountVectorizer()), # strings to token integer counts. ('classifier', MultinomialNB()), ])

pipeline.fit(X_train,y_train)

But I get errors indicating that the sizes are incompatible.

Does anyone know of an easier way to accomplish bringing the sparse matrix returned by the vectorizer along with the numerical ones so that I may train the algorithm?

All help is appreciated.

Edit:

I have trained this algorithm with no problems whatsoever using only the vectorized text. My issue arises when trying to incorporate additional features into the training set.

1

There are 1 answers

0
AndyShan On

According to your code, you can count the word frequency of text information by CountVectorizer()
But when you calling code like this:

X = cv.fit_transform(X)

You will get data of type <'scipy.sparse.csr.csr_matrix'>, instead of <'numpy.ndarray'>. So when you do data fusion, there may be problems.
You can use this code to get data of type <'numpy.ndarray'>

X = cv.fit_transform(X).toarray()

And the data looks like this:

print X
[[1 1 0 0 1]
 [1 0 0 1 1]
 [1 0 1 0 1]]
print type(x)
<type 'numpy.ndarray'>