I have a table containing user names (~1 000 rows) called "potential_users" and another one called "actual_users" (~ 10 million rows). All records are exclusively made up of [a-z] characters, no white space. Additionally, I know that none of the potential_users are in the actual_users tables.
I would like to be able to calculate for each row in potential_users what is the closest record in actual_users based on the Levenshtein distance. For example:
| potential_users|
|----------------|
| user1 |
| kajd |
| bbbbb |
and
| actual_users |
|--------------|
| kaj |
| bbbbbbb |
| user |
Would return:
| potential_users | actual_users | levenshtein_distance |
|-----------------|--------------|----------------------|
| user1 | user | 1 |
| kajd | kaj | 1 |
| bbbbb | bbbbbbb | 2 |
If the tables were short, I could make a cross join that would calculate for each record in potential_users the Levenshtein distance in actual_users and then return the one with the lowest value. However, in my case this would create an intermediary table of 1 000 x 10 000 000 rows, which is a little impractical.
Is there a cleaner way to perform such operation with creating a cross join?
Unfortunately, there's no way to do it without a cross join. At the end of the day, every potential user needs to be tested against every actual user.
However, Trino (formerly known as Presto SQL)will execute the join in parallel across many threads and machines, so it can execute very quickly given enough hardware. Note that in Trino, the intermediate results are streamed from operator to operator, so there's no "intermediate table" with 10M x 1k rows for this query.
For a query like
This is the query plan:
In particular, for this section, as soon as a row is produced by the cross join, it is fed into the projection operator that calculates the Levenshtein distance between the two values and then into the aggregation, which only stores one group per "potential" user. Therefore, the amount of memory required by this query should be low.