I am creating a simple recommender that would recommend other users based on the similarity of the tweets. I used tfidf to vectorize all the text and I was able to fit the data into a MultinomialNB but I keep getting errors of trying to predict

I've tried to reshaping the data into an array, but I get an error can't convert string to float. Can I even use this algorithm for this data? I tried different columns to see if I get a result, but same positional error.

ValueError                                Traceback (most recent call last)
<ipython-input-39-a982bc4e1f49> in <module>
     20     nb_mul.fit(train_idf,y_train)
     21     user_knn = UserUser(10, min_sim = 0.4, aggregate='weighted-average')
---> 22     nb_mul.predict(y_test)
     23     #nb_mul.predict(np.array(test['Tweets'], test['Sentiment']))
     24     #TODO: find a way to predict with test data

~/anaconda2/lib/python3.6/site-packages/sklearn/naive_bayes.py in predict(self, X)
     64             Predicted target values for X
     65         """
---> 66         jll = self._joint_log_likelihood(X)
     67         return self.classes_[np.argmax(jll, axis=1)]
     68 

~/anaconda2/lib/python3.6/site-packages/sklearn/naive_bayes.py in _joint_log_likelihood(self, X)
    728         check_is_fitted(self, "classes_")
    729 
--> 730         X = check_array(X, accept_sparse='csr')
    731         return (safe_sparse_dot(X, self.feature_log_prob_.T) +
    732                 self.class_log_prior_)

~/anaconda2/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    525             try:
    526                 warnings.simplefilter('error', ComplexWarning)
--> 527                 array = np.asarray(array, dtype=dtype, order=order)
    528             except ComplexWarning:
    529                 raise ValueError("Complex data not supported\n"

~/anaconda2/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    536 
    537     """
--> 538     return array(a, dtype, copy=False, order=order)
    539 
    540 

ValueError: could not convert string to float: '["b\'RT @Avalanche: Only two cities have two teams in the second round of the playoffs...\\\\n\\\\nDenver and Boston!\\\\n\\\\n#MileHighBasketball #GoAvsGo http\\\\xe2\\\\x80\\\\xa6\'"]'

for train, test in xf.partition_users(final_test[['user','Tweets','Sentiment']],5, xf.SampleFrac(0.2)):
    x_train = []
    for index, row in train.iterrows():
        x_train.append(row['Tweets'])
    y_train = np.array(train['Sentiment'])
    y_test = np.array([test['user'],test['Tweets']])
    #print(y_train)
    tfidf = TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf=True,stop_words='english', lowercase=False)
    train_idf = tfidf.fit(x_train)
    train_idf = train_idf.transform(x_train)
    nb_mul = MultinomialNB()
    nb_mul.fit(train_idf,y_train)
    user_knn = UserUser(10, min_sim = 0.4, aggregate='weighted-average')
    nb_mul.predict(y_test)

The data looks like this

   user                                             Tweets  \
0              2287418996  ["b'RT @HPbasketball: This stuff is 100% how K...   
1              2287418996  ["b'@KeuchelDBeard I may need to rewatch Begin...   
2              2287418996  ["b'@keithlaw Is that the stated reason for th...   
3              2287418996  ['b"@keithlaw @Yanks23242 I definitely don\'t ...   
4              2287418996  ["b'@Yanks23242 @keithlaw Sorry, please sub Jo...   
     Sentiment  Score  
0          neu  0.815  
1          neu  0.744  
2          neu  1.000  
3          neu  0.863  
4          neu  0.825 

Again, I expect to insert users with their tweets and sentiment and recommend another user in the data based off of similarity.

1 Answers

1
AI_Learning On Best Solutions

You should not feed the tweets directly to the classifier. You need to use the fitted TfidfVectorizer for transforming text to vectors.

Make the following change

nb_mul.predict(tfidf.transform(test['Tweets']))

Understand that this model will only give the sentiment of the test data tweets.

If your intention is recommendation try using other recommendation methodologies.