I am trying to find similarity between each pair of items. Items are in a python dictionary and I find the similarity taking pair at a time. The code is -
def allSimilarity(itemsDict, similarityMetric):
itemList = itemsDict.keys()
itemSimilarityDict = {}
for item1 in itemList:
itemSimilarityDict[item1] = {}
for item2 in itemList:
if(item1 == item2):
continue
itemSimilarityDict[item1][item2] = similarityMetric(itemsDict, item1, item2)
return itemSimilarityDict
The problem is that outer loop is taking 5 seconds for each item. I have ~300,000 items so it takes ~18 days for the whole computation. Is there any way to increase the speed? Can I use packages like Theano, Tensorflow and use GPU for this? Or can take a cloud and parallelize the process?
I don't think a machine learning library would be particularly helpful here if there is no operations or building blocks readily available for this type of all to all similarity comparison.
I think you'd have better luck by looking at more generic parallelization solutions: OpenMP, TBB, MapReduce, AVX, CUDA, MPI, map reduce, etc.
Also, rewriting the same code in C++ will surely speed things up.