I have the following code for similarity scoring:
from rapidfuzz import process, fuzz
import pandas as pd
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
names = df_test["name"]
scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1), columns=names, index=names)
x, y = np.where(scores > 50)
groups = (pd.DataFrame(scores.index[x], scores.index[y])
.groupby(level=0)
.agg(frozenset)
.drop_duplicates()
.reset_index(drop=True)
.reset_index()
.explode("name"))
groups.rename(columns={'index': 'id'}, inplace=True)
groups.id+= 1
df_test = df_test.merge(groups, how="left")
I want to identify similar names in name
column if those names belong to one cluster number and create unique id for them. For example South Beach
and Beach
belong to cluster number 1
and their similarity score is pretty high. So we associate it with unique id, say 1
. Next cluster is number 2
and three entities from name
column belong to this cluster: Dog
, Big Dog
and Cat
. Dog
and Big Dog
have high similarity score and their unique id will be, say 2
. For Cat
unique id will be, say 3
. And so on.
Code generates expected result:
name cluster_number id
0 South Beach 1 1
1 Dog 2 2
2 Bird 3 3
3 Ant 3 4
4 Big Dog 2 2
5 Beach 1 1
6 Dear 4 5
7 Cat 2 6
Code above represents efficient and vectorized method for similarity scoring. It works perfectly for small data sets but when I try a dataframe with 1 million rows I get a memoryError
for function rapidfuzz.process.cdist(...)
. As mention in comment section bellow this function returns a matrix of len(queries) x len(choices) x size(dtype). By default this dtype is float or int32_t depending on the scorer (for the default scorer you are using it is float). So for 1 million names, the result matrix would require about 4 terabytes of memory. My PC has 12GB of free RAM space but it is not near enough. Any ideas how to avoid overloading RAM but keep computation in vectorized form?
For @J.M.Arnold solution including his comment the code may be rewritten as:
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)
names = df_test["name"]
def calculate_similarity_matrix(names):
scores = pd.DataFrame(process.cdist(names, names, workers=-1), columns=names, index=names)
return scores
chunks = np.array_split(names, 1000)
_ = []
for i, chunk in enumerate(chunks):
matrix = calculate_similarity_matrix(chunk)
_.append(matrix)
finished = pd.concat(_)
x, y = np.where(finished > 50)
groups = (pd.DataFrame(finished.index[x], finished.index[y])
.groupby(level=0)
.agg(frozenset)
.drop_duplicates()
.reset_index(drop=True)
.reset_index()
.explode("name"))
groups.rename(columns={'index': 'id'}, inplace=True)
groups.id+= 1
df_test = df_test.merge(groups, how="left")
But it will not generate correct results:
name cluster_number id
0 Beach 1 2
1 South Beach 1 8
2 Big Dog 2 3
3 Cat 2 5
4 Dog 2 7
5 Ant 3 1
6 Bird 3 4
7 Dear 4 6
Note as e.g. Dog
and Big Dog
have different id
but they should have the same.
As maxbachmann said in your GitHub issue it's all about about the default type:
If you take a look at the documentation of
rapidfuzz.process.dist
you can see that the data-type is specified as follows:You can calculate the size of the matrix via
len(queries) x len(choices) x size(dtype)
which for your current implementation is1 million x 1 million x 8 bytes
(for float - which is the default for the scorer you are using). That is about 7.6TB! (Even for int32 with 4 bytes - as Max Bachmann mentioned) you end up with 3.8 TB of required space.One option to dodge your issue is to decrease the size of the dtype - e.g. using
int8
with 1 byte. Obviously, you will have significantly less accurate similarity scores as the value range is from -128 to 127! With the above mentioned formula you would be able to decrease the size down to ~950GB!Another approach (and probably the only viable in the long-run) is to split up the data and process it in smaller chunks - as Max Bachmann suggested.
Afterwards, you will have the full calculated similarity matrix in
finished
!This approach will allow you to process larger data sets without running out of memory / getting a memory overload (as your question asked)! This is because of the fact that the matrices are stored on your disk between iterations.
However, my approach will be definitely slower (in comparison to processing all the data at once - which is not possible unless you had 3TB+ of RAM) as you will need to read and write to your disk 1,000 times.
Obviously, you can play around with the chunk amounts you're using. In my current approach you have 1,000 chunks with 1,000 names each. Each step (with
float
as 8 bytes) will only require 8MB of RAM as per our above formula. You can play around and fit your optimum for your hardware!