I am using Numba to improve the speed of the below loop. without Numba it takes 135 sec to execute and with Numba it takes 0.30 sec :) which is very fast.
In the below loop I comparing the array with a threshold of 0.85. If the condition turns out to be True I am inserting the data into the List which will be returned by the function.
The data which is getting inserted into the List looks like this.
['Source ID', 'Source TEXT', 'Similar ID', Similar TEXT, 'Score']
idd = df['ID'].to_numpy()
txt = df['TEXT'].to_numpy()
Column = 'TEXT'
df = preprocessing(dataresult, Column) # removing special characters of 'TEXT' column
message_embeddings = model_url(np.array(df['DescriptionNew'])) #passing df to universal sentence encoder model to create sentence embedding.
cos_sim = cosine_similarity(message_embeddings) #len(cos_sim) > 8000
# Below function finds duplicates amoung rows.
@numba.jit(nopython=True)
def similarity(nid, txxt, cos_sim, threshold):
numba_list = List()
for i in range(cos_sim.shape[0]):
for index in range(i, cos_sim.shape[1]):
if (cos_sim[i][index] > threshold) & (i!=index):
numba_list.append([nid[i], nid[index], cos_sim[i][index]]) # either this works
# numba_list.append([txxt[i], txxt[index]]) # or either this works
# numba_list.append([nid[i], txxt[i], nid[index], txxt[index], cos_sim[i][index]]) # I want this to work.
return numba_list
print(similarity(idd, txt, cos_sim, 0.85))
In the above code during appending List either columns with numbers get appended or either Text. I want all the columns with both numbers and text to get inserted into the numba_list
.
I am getting below Error
1 frames
/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py in error_rewrite(e, issue_type)
359 raise e
360 else:
--> 361 raise e.with_traceback(None)
362
363 argtypes = []
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Poison type used in arguments; got Poison<LiteralList((int64, [unichr x 12], int64, [unichr x 12], float32))>
During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'append') for ListType[undefined])
During: typing of call at <ipython-input-179-6ee851edb6b1> (14)
File "<ipython-input-179-6ee851edb6b1>", line 14:
def zero(nid, txxt, cos_sim, threshold):
<source elided>
# print(i+1)
numba_list.append([nid[i], txxt[i], nid[index], txxt[index], cos_sim[i][index]])
^
The problem you are facing comes from typing issues: Numba cannot infer the type of the list. The root of the problem is that you are dealing with list containing different item types (which is AFAIK not supported by Numba yet and would not be efficient anyway). However, tuples are made for that. Here is an untested example:
Since the condition is often true, you can use pre-allocated Numpy arrays with direct indexing rather than slow list
append
calls to strongly speed up the computation. However, the return type will be different with this solution. The idea is to return a tuple of 3 arrays in the example rather than a list of tuples with 3 item each. This solution also benefit from taking significantly less memory.