Dealing with Sparse Matrices and multiple numerical features when training algorithm

Question

Dealing with Sparse Matrices and multiple numerical features when training algorithm

516 views Asked by Walter U. At 31 August 2017 at 22:47

I have a data frame that looks as follows:

                                    description      priority  CDT  JDT  
0  Create Help Index Fails with seemingly incorre...       P3    0    0       
1  Internal compiler error when compiling switch ...       P3    0    1       
2  Default text sizes in org.eclipse.jface.resour...       P3    0    0       
3  [Presentations] [ViewMgmt] Holding mouse down ...       P3    0    0       
4  Parsing of function declarations in stdio.h is...       P2    1    0       

PDE  Platform  Web Tools  priorityLevel  
0         0          0              2  
1         0          0              2  
2         1          0              2  
3         1          0              2  
4         0          0              1

I am currently trying to train an ML algorithm that would take the text in 'description' along with the rest of the numerical features except for 'priority' (discarded) and 'priorityLevel' (true labels).

This is basically an NLP application. The issue I'm having is that 'description' must first go through a 'CountVectorizer()' function:

X = df['description'] cv = CountVectorizer() X = cv.fit_transform(X)

The output that returns is incompatible with the rest of the data frame when I go to pass it to the training algorithm.

I need to be able to combine X after it has been vectorized, along with df[['CDT', 'JDT', 'PDE', 'Platform', 'Web Tools']] into a single variable in order to split and train:

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state=101)

nb = MultinomialNB() nb.fit(X_train, y_train)

In essence, X should contain the vectorized text, along with the numerical variables. All efforts thus far have failed.

I have tried to do through a pipeline as well:

pipeline = Pipeline([ ('bow', CountVectorizer()), # strings to token integer counts. ('classifier', MultinomialNB()), ])

pipeline.fit(X_train,y_train)

But I get errors indicating that the sizes are incompatible.

Does anyone know of an easier way to accomplish bringing the sparse matrix returned by the vectorizer along with the numerical ones so that I may train the algorithm?

All help is appreciated.

Edit:

I have trained this algorithm with no problems whatsoever using only the vectorized text. My issue arises when trying to incorporate additional features into the training set.

Original Q&A

There are 1 answers

**AndyShan** · Answer 1 · 2017-09-01T02:02:32+00:00

According to your code, you can count the word frequency of text information by CountVectorizer()
But when you calling code like this:

X = cv.fit_transform(X)

You will get data of type <'scipy.sparse.csr.csr_matrix'>, instead of <'numpy.ndarray'>. So when you do data fusion, there may be problems.
You can use this code to get data of type <'numpy.ndarray'>

X = cv.fit_transform(X).toarray()

And the data looks like this:

print X
[[1 1 0 0 1]
 [1 0 0 1 1]
 [1 0 1 0 1]]
print type(x)
<type 'numpy.ndarray'>

TechQA.

Dealing with Sparse Matrices and multiple numerical features when training algorithm

Edit:

There are 1 answers

Related Questions in MACHINE-LEARNING

Related Questions in NLP

Related Questions in SPARSE-MATRIX

Related Questions in COUNTVECTORIZER

Popular Questions

Trending Questions