Numpy Vs nested dictionaries, which one is more efficient in terms of runtime and memory?

562 views Asked by At

I am new to numpy.I have referred to the following SO question: Why NumPy instead of Python lists?

The final comment in the above question seems to indicate that numpy is probably slower on a particular dataset.

I am working on a 1650*1650*1650 data set. These are essentially similarity values for each movie in the MovieLens data set along with the movie id.

My options are to either use a 3D numpy array or a nested dictionary. On a reduced data set of 100*100*100, the run times were not too different.

Please find the Ipython code snippet below:

for id1 in range(1,count+1):
    data1 = df[df.movie_id == id1].set_index('user_id')[cols]
    sim_score = {}
    for id2 in range (1, count+1):
        if id1 != id2:
            data2 = df[df.movie_id == id2].set_index('user_id')[cols]
            sim = calculatePearsonCorrUnified(data1, data2) 
        else: 
            sim = 1
        sim_matrix_panel[id1]['Sim'][id2] = sim



import pdb
from math import sqrt
def calculatePearsonCorrUnified(df1, df2):

sim_score = 0
common_movies_or_users = []

for temp_id in df1.index:
    if temp_id in df2.index:
        common_movies_or_users.append(temp_id)
#pdb.set_trace()
n = len(common_movies_or_users)
#print ('No. of common movies: ' + str(n))
if n == 0:
    return sim_score;

# Ratings corresponding to user_1 / movie_1, present in the common list 
rating1 = df1.loc[df1.index.isin(common_movies_or_users)]['rating'].values
# Ratings corresponding to user_2 / movie_2, present in the common list 
rating2 = df2.loc[df2.index.isin(common_movies_or_users)]['rating'].values


sum1 = sum (rating1)
sum2 = sum (rating2)

# Sum up the squares
sum1Sq = sum (np.square(rating1))
sum2Sq = sum (np.square(rating2))

# Sum up the products
pSum = sum(np.multiply(rating1, rating2))

# Calculate Pearson score
num = pSum-(sum1*sum2/n)
den = sqrt(float(sum1Sq-pow(sum1,2)/n) * float(sum2Sq-pow(sum2,2)/n))
if den==0: return 0
sim_score = (num/den)

return sim_score    

What would be the best way to most precisely time the runtime with either of these options?

Any pointers would be greatly appreciated.

0

There are 0 answers