How to do effective matrix computation and not get memory overload for similarity scoring?

317 views Asked by At

I have the following code for similarity scoring:

from rapidfuzz import process, fuzz
import pandas as pd

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
names = df_test["name"]
scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1),  columns=names, index=names)
x, y = np.where(scores > 50)
groups = (pd.DataFrame(scores.index[x], scores.index[y])
           .groupby(level=0)
           .agg(frozenset)
           .drop_duplicates()
           .reset_index(drop=True)
           .reset_index()
           .explode("name"))
groups.rename(columns={'index': 'id'}, inplace=True)
groups.id+= 1
df_test = df_test.merge(groups, how="left")

I want to identify similar names in name column if those names belong to one cluster number and create unique id for them. For example South Beach and Beach belong to cluster number 1 and their similarity score is pretty high. So we associate it with unique id, say 1. Next cluster is number 2 and three entities from name column belong to this cluster: Dog, Big Dog and Cat. Dog and Big Dog have high similarity score and their unique id will be, say 2. For Cat unique id will be, say 3. And so on.

Code generates expected result:

    name        cluster_number id
0   South Beach 1              1
1   Dog         2              2
2   Bird        3              3
3   Ant         3              4
4   Big Dog     2              2
5   Beach       1              1
6   Dear        4              5
7   Cat         2              6

Code above represents efficient and vectorized method for similarity scoring. It works perfectly for small data sets but when I try a dataframe with 1 million rows I get a memoryError for function rapidfuzz.process.cdist(...). As mention in comment section bellow this function returns a matrix of len(queries) x len(choices) x size(dtype). By default this dtype is float or int32_t depending on the scorer (for the default scorer you are using it is float). So for 1 million names, the result matrix would require about 4 terabytes of memory. My PC has 12GB of free RAM space but it is not near enough. Any ideas how to avoid overloading RAM but keep computation in vectorized form?

For @J.M.Arnold solution including his comment the code may be rewritten as:

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)
names = df_test["name"]
def calculate_similarity_matrix(names):
    scores = pd.DataFrame(process.cdist(names, names, workers=-1),  columns=names, index=names)
    return scores
chunks = np.array_split(names, 1000)
_ = []
for i, chunk in enumerate(chunks):
    matrix = calculate_similarity_matrix(chunk)
    _.append(matrix)
finished = pd.concat(_)
x, y = np.where(finished > 50)
groups = (pd.DataFrame(finished.index[x], finished.index[y])
           .groupby(level=0)
           .agg(frozenset)
           .drop_duplicates()
           .reset_index(drop=True)
           .reset_index()
           .explode("name"))
groups.rename(columns={'index': 'id'}, inplace=True)
groups.id+= 1
df_test = df_test.merge(groups, how="left")

But it will not generate correct results:

          name  cluster_number             id
0        Beach               1              2
1  South Beach               1              8
2      Big Dog               2              3
3          Cat               2              5
4          Dog               2              7
5          Ant               3              1
6         Bird               3              4
7         Dear               4              6

Note as e.g. Dog and Big Dog have different id but they should have the same.

1

There are 1 answers

6
J. M. Arnold On

As maxbachmann said in your GitHub issue it's all about about the default type:

default this dtype is float or int32_t depending on the scorer (for the default scorer you are using it is float)

If you take a look at the documentation of rapidfuzz.process.dist you can see that the data-type is specified as follows:

similarity: - np.float32, np.float64 - np.uint8 -> stores fixed point representation of the result scaled to a range 0-100

distance: - np.int8, np.int16, np.int32, np.int64

If not given, then the type will be np.float32 for similarities and np.int32 for distances.

You can calculate the size of the matrix via len(queries) x len(choices) x size(dtype) which for your current implementation is 1 million x 1 million x 8 bytes (for float - which is the default for the scorer you are using). That is about 7.6TB! (Even for int32 with 4 bytes - as Max Bachmann mentioned) you end up with 3.8 TB of required space.

One option to dodge your issue is to decrease the size of the dtype - e.g. using int8 with 1 byte. Obviously, you will have significantly less accurate similarity scores as the value range is from -128 to 127! With the above mentioned formula you would be able to decrease the size down to ~950GB!

Another approach (and probably the only viable in the long-run) is to split up the data and process it in smaller chunks - as Max Bachmann suggested.

  1. Define a function that handles the calculation of the similarity scores for a matrix. (Similar to your code)
  2. Split the list of names into smaller pieces.
  3. Iterate over the chunks and store the similarity matrix for each step.
  4. Concatenate the results into one big matrix.
import numpy as np

# Step 1
def calculate_similarity_matrix(names):
    # Do your part, e.g. processing and so forth. But after all, return the similarity matrix for "names"
    scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1),  columns=names, index=names)
    return scores

# Step 2
# Split the names list into chunks - e.g. in portions of 1000 names each
chunks = np.array_split(names, 1000)

# Step 3
# Iterate over the names and store the matrix on the disk
for i, chunk in enumerate(chunks):
    matrix = calculate_similarity_matrix(chunk)
    matrix.to_pickle(f"matrix_{i}.pkl")

# Step 4
# Read the matrices
matrices = [pd.read_pickle(f"matrix_{i}.pkl") for i in range(len(chunks))]
# Concatenate
finished = pd.concat(matrices)

Afterwards, you will have the full calculated similarity matrix in finished!

This approach will allow you to process larger data sets without running out of memory / getting a memory overload (as your question asked)! This is because of the fact that the matrices are stored on your disk between iterations.

However, my approach will be definitely slower (in comparison to processing all the data at once - which is not possible unless you had 3TB+ of RAM) as you will need to read and write to your disk 1,000 times.

Obviously, you can play around with the chunk amounts you're using. In my current approach you have 1,000 chunks with 1,000 names each. Each step (with float as 8 bytes) will only require 8MB of RAM as per our above formula. You can play around and fit your optimum for your hardware!