I am new to numpy.I have referred to the following SO question: Why NumPy instead of Python lists?
The final comment in the above question seems to indicate that numpy is probably slower on a particular dataset.
I am working on a 1650*1650*1650 data set. These are essentially similarity values for each movie in the MovieLens data set along with the movie id.
My options are to either use a 3D numpy array or a nested dictionary. On a reduced data set of 100*100*100, the run times were not too different.
Please find the Ipython code snippet below:
for id1 in range(1,count+1):
data1 = df[df.movie_id == id1].set_index('user_id')[cols]
sim_score = {}
for id2 in range (1, count+1):
if id1 != id2:
data2 = df[df.movie_id == id2].set_index('user_id')[cols]
sim = calculatePearsonCorrUnified(data1, data2)
else:
sim = 1
sim_matrix_panel[id1]['Sim'][id2] = sim
import pdb
from math import sqrt
def calculatePearsonCorrUnified(df1, df2):
sim_score = 0
common_movies_or_users = []
for temp_id in df1.index:
if temp_id in df2.index:
common_movies_or_users.append(temp_id)
#pdb.set_trace()
n = len(common_movies_or_users)
#print ('No. of common movies: ' + str(n))
if n == 0:
return sim_score;
# Ratings corresponding to user_1 / movie_1, present in the common list
rating1 = df1.loc[df1.index.isin(common_movies_or_users)]['rating'].values
# Ratings corresponding to user_2 / movie_2, present in the common list
rating2 = df2.loc[df2.index.isin(common_movies_or_users)]['rating'].values
sum1 = sum (rating1)
sum2 = sum (rating2)
# Sum up the squares
sum1Sq = sum (np.square(rating1))
sum2Sq = sum (np.square(rating2))
# Sum up the products
pSum = sum(np.multiply(rating1, rating2))
# Calculate Pearson score
num = pSum-(sum1*sum2/n)
den = sqrt(float(sum1Sq-pow(sum1,2)/n) * float(sum2Sq-pow(sum2,2)/n))
if den==0: return 0
sim_score = (num/den)
return sim_score
What would be the best way to most precisely time the runtime with either of these options?
Any pointers would be greatly appreciated.