The actual situation is that I need to find users with similar interests according to the url favorites of a large number of users. So my data only have "like" without "dislike" and "ignore". And for the number of urls is almost unlimited, it is also impossible to assume that all urls without "like" are "dislike" or "ignore". So, in this case, how should I convert the raw data to a Surprise Dataset? Or, these data is impossible to used by algorithms such as KNN and so on for relative recommendation of collaborative filtering?
source data of favorite items per User:
s_data = [
[
"user1",
[
"item1",
"item2",
"item3",
"item4",
"item5",
"item6"
]
],
[
"user2",
[
"item3",
"item4",
"item5",
"item6"
]
],
[
"user3",
[
"item1",
"item2",
"item3",
"item6"
]
],
[
"user4",
[
"item4",
"item5",
"item6",
"item7",
"item8",
"item9"
]
]
]
Because there is only one case in the original data that the user "likes" the item, I will assume that the user scored '1' for the item they liked. Python Code:
import pandas as pd
from surprise import Dataset, KNNBasic, Reader
# prepare for data
df_pre = [[z[0], zz, 1] for z in s_data if z[1] is not None for zz in z[1]]
df = pd.DataFrame(df_pre)
reader = Reader(rating_scale=(0, 1))
data = Dataset.load_from_df(df, reader)
trainset = data.build_full_trainset()
# trainning
sim_options = {'name': 'pearson', 'user_based': True}
algo = KNNBasic(sim_options=sim_options)
algo.fit(trainset)
# calc similarity
inner_id = algo.trainset.to_inner_uid(ruid='user1')
all_instances = algo.trainset.all_users
rs = [(x, algo.sim[inner_id][x]) for x in all_instances() if x != inner_id]
sorted_rs = sorted(rs, key=lambda x: x[1], reverse=True)
print(sorted_others)
result: [(1, 0.0), (2, 0.0), (3, 0.0)]
the similarity between each users:
raw data in tabular form:
As shown above, the result obtained by the program is that the correlation between all people is 0. If I change to cosine, msd, the result is the same. If it is replaced by pearson_baseline, it will prompt "ZeroDivisionError: float division".

I want to know how to use KNN to find similar behavior users of a certain user with data as shown above. Thanks a lot.


You need to include information about items that users do not like so that you have both 0s and 1s in your dataset. The data should look like this (just screenshotting the top part here):
I got this dataframe with this code:
Now running your code with the new df:
Gives:
Which I believe is more like what you expected to see.