I am using scikit-learn to build a classifier that predicts if two sentences are paraphrases or not (e.g. paraphrases: How tall was Einstein vs. What was Albert Einstein's length).
My data consists of 2 columns with strings (phrase pairs) and 1 target column with 0's and 1's (= no paraphrase, paraphrase). I want to try different algorithms.
I expect the last line of code below to fit the model. Instead, the pre-processing Pipeline keeps producing an error I cannot solve: "AttributeError: 'numpy.ndarray' object has no attribute 'lower'."
The code is below and I have isolated the error happening in the last line shown (for brevity I have excluded the rest). I suspect it is because the target column contains 0s and 1s, which cannot be turned lowercase.
I have tried the answers to similar questions on stackoverflow, but no luck so far.
How can you work around this?
question1 question2 is_paraphrase
How long was Einstein? How tall was Albert Einstein? 1
Does society place too How do sports contribute to the 0
much importance on society?
sports?
What is a narcissistic What is narcissistic personality 1
personality disorder? disorder?
======
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
para = "paraphrases.tsv"
df = pd.read_csv(para, usecols = [3, 5], nrows = 100, header=0, sep="\t")
y = df["is_paraphrase"].values
X = df.drop("is_paraphrase", axis=1).values
X = X.astype(str) # I have tried this
X = np.char.lower(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,
random_state = 21, stratify = y)
text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
text_clf.fit(X_train, y_train)
The error is not because of the last column, it is because your Train xdataset will contain two columns
question1
andquestion2
. Now this will result in youX_train
having each row as list of values. So when theCountVectorizer
is trying to convert it into lower case, it is returning an error since a numpy.ndarray does not contain lower function.To overcome this problem you need to split the dataset
X_train
into two parts, sayX_train_pt1
andX_train_pt2
. Then perform CountVectorizer on these indiviudally, followed by tfidfTransformer on each individual result. Also ensure that you same object for transformation on these datasets.Finally you stack these two arrays together and give it as input to your classifier. You can find a similar implementation here.
Update :
I think the following should be of some help (I admit this code can be further improved for more efficiency):