How to solve Scikit learn preprocessing pipeline error with Numpy array?

1.4k views Asked by At

I am using scikit-learn to build a classifier that predicts if two sentences are paraphrases or not (e.g. paraphrases: How tall was Einstein vs. What was Albert Einstein's length).

My data consists of 2 columns with strings (phrase pairs) and 1 target column with 0's and 1's (= no paraphrase, paraphrase). I want to try different algorithms.

I expect the last line of code below to fit the model. Instead, the pre-processing Pipeline keeps producing an error I cannot solve: "AttributeError: 'numpy.ndarray' object has no attribute 'lower'."

The code is below and I have isolated the error happening in the last line shown (for brevity I have excluded the rest). I suspect it is because the target column contains 0s and 1s, which cannot be turned lowercase.

I have tried the answers to similar questions on stackoverflow, but no luck so far.

How can you work around this?

question1               question2                        is_paraphrase
How long was Einstein?  How tall was Albert Einstein?    1
Does society place too  How do sports contribute to the  0
much importance on      society?
sports?                 
What is a narcissistic  What is narcissistic personality 1  
personality disorder?   disorder?

======

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

para = "paraphrases.tsv"

df = pd.read_csv(para, usecols = [3, 5], nrows = 100, header=0, sep="\t")

y = df["is_paraphrase"].values
X = df.drop("is_paraphrase", axis=1).values
X = X.astype(str) # I have tried this
X = np.char.lower(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, 
random_state = 21, stratify = y)

text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()), 
('clf', MultinomialNB())])

text_clf.fit(X_train, y_train)
1

There are 1 answers

4
Gambit1614 On

The error is not because of the last column, it is because your Train xdataset will contain two columns question1 and question2. Now this will result in you X_train having each row as list of values. So when the CountVectorizer is trying to convert it into lower case, it is returning an error since a numpy.ndarray does not contain lower function.

To overcome this problem you need to split the dataset X_train into two parts, say X_train_pt1 and X_train_pt2. Then perform CountVectorizer on these indiviudally, followed by tfidfTransformer on each individual result. Also ensure that you same object for transformation on these datasets.

Finally you stack these two arrays together and give it as input to your classifier. You can find a similar implementation here.

Update :
I think the following should be of some help (I admit this code can be further improved for more efficiency):

def flat_list(my_list):
    return [str(item) for sublist in my_list for item in sublist]


def transform_data(trans_obj_list,dataset_splits):
    X_train = dataset_splits[0].astype(str)
    X_train = flat_list(X_train)

    for trfs in trans_obj_list:
        transformed_vector = trfs().fit(X_train)
        for x in xrange(0,len(dataset_splits)):
            dataset_splits[x] =flat_list(dataset_splits[x].astype(str))
            dataset_splits[x]=transformed_vector.transform(dataset_splits[x])

    return dataset_splits

new_X_train,new_X_test = transform_data([CountVectorizer,TfidfTransformer],[X_train,X_test])