I have created a recommender system. There are 2 dataframes – input_df and recommended_df
input_df – Dataframe of content already viewed by users. This df is used for generating the recommendations
User_Name Viewed_Content_Name
User1 Content1
User1 Content2
User1 Content5
User2 Content1
User2 Content3
User2 Content5
User2 Content6
User2 Content8
Recommended_df – Dataframe of content recommended to users
User_Name Recommended_Content_Name
User1 Content1 # This recommendation has already been viewed by User1. Hence this recommendation should be removed
User1 Content8
User2 Content2
User2 Content7
I want to remove recommendations if they have already been viewed by the user. I have tried following two approaches, but both of them are very time consuming. I need an approach which will identify occurrence of row in input_df and recommended_df
Approach 1 - Using subsetting, for each row in recommended_df, I try to see if that row has already occurred in input_df
for i in range(len(recommended_df)):
recommended_df.loc[i,'Recommendation_Completed']=len(input_df [(input_df ['User_Name']== recommended_df.loc[i,'User_Name']) & (input_df ['Viewed_Content_Name']== recommended_df.loc[i,'Recommended_Content_Name'])])
recommended_df = recommended_df.loc[recommended_df['Recommendation_Completed']==0]
# Remove row if already occured in input_df
Approach 2 - Try to see if the row in recommended_df occurs in input_df using apply
Created a key column in input_df and recommended_df. This is unique key for each user and content
Input_df =
User_Name Viewed_Content_Name keycol (User_Name + Viewed_Content_Name)
User1 Content1 User1Content1
User1 Content2 User1Content2
User1 Content5 User1Content5
User2 Content1 User2Content1
User2 Content3 User2Content3
User2 Content5 User2Content5
User2 Content6 User2Content6
User2 Content8 User2Content8
Recommended_df =
User_Name Recommended_Content_Name keycol (User_Name + Recommended_Content_Name)
User1 Content1 User1Content1
User1 Content8 User1Content8
User2 Content2 User2Content2
User2 Content7 User2Content7
recommended_df ['Recommendation_Completed'] = recommended_df ['keycol'].apply(lambda d: d in input_df ['keycol'].values)
recommended_df = recommended_df.loc[recommended_df['Recommendation_Completed']==False]
# Remove if row occurs in input_df
The second approach using apply is faster than approach 1, but i can still do the same thing faster in excel if i use the countifs function. How can I replicate it faster using python?
Try to only use apply as a last resort. You can concatenate user and content and then use boolean selection.