I have attempted to build a recommendation system using the Surprise library and the k-Nearest Neighbors (KNN) algorithm. The primary challenge I've encountered is the very high RMSE (Root Mean Square Error) score, which currently stands at RMSE: 3179.9423.
The data I'm working with is an imputed user-item matrix where the ratings are derived from customer interactions using the formula: IR_iu = 100 * Buy + 50 * Added to Favorite + 15 * Interacted with the Item
In this formula, (IR_{iu} ) represents the imputed rating for user ( u ) on item ( i ). The interactions are weighted, with a higher score assigned to purchasing (Buy), a medium score for adding to favorites, and a lower score for general interaction with the item.
My expectation is to uncover a more effective approach to diminish the RMSE score and enhance the precision of my predictions. This consideration takes into account the unique characteristics of the imputed user-item matrix shaped by customer interactions. Additionally, I am open to exploring alternative algorithms that may better suit my problem. It's worth noting that my experience in this field is limited, and this marks my initial attempt at constructing a recommendation system without the guidance of seasoned mentors. As a result, I have undertaken a trial-and-error approach to navigate through the challenges
df = pd.read_excel("D:\SELECT\CustomerRatings.xlsx")
df.replace(np.nan, 0, inplace = True)
# Separate the first column (user IDs) from the rest of the data
user_ids = df.iloc[:, 0]
data_without_user_ids = df.iloc[:, 1:]
# Initialize the MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 10))
# Normalize each row of the data without user IDs
normalized_data = pd.DataFrame(scaler.fit_transform(data_without_user_ids.T).T, columns=data_without_user_ids.columns)
# Combine the user IDs and the normalized data into a new DataFrame
normalized_df = pd.concat([user_ids, normalized_data], axis=1)
# 'normalized_df' now contains the user IDs and the scaled data
# Reset the index and melt the DataFrame to long format
user_item_matrix = normalized_df.reset_index()
melted_data = pd.melt(user_item_matrix, id_vars=['UserID'], var_name='item', value_name='rating')
reader = Reader(rating_scale=(0, 10))
data = Dataset.load_from_df(melted_data, reader)
from surprise import KNNBasic
from surprise.model_selection import train_test_split
# Split the dataset into training and testing sets
trainset, testset = train_test_split(data, test_size=0.3)
sim_options = {
"name": "cosine",
"user_based": False, # compute similarities between items
}
algo = KNNBasic(sim_options=sim_options)
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)