Is the number of rows always 1 in each band in the Spark implementation of MinHashLSH

811 views Asked by At

I'm trying to understand the MinHash LSH implementation in Spark, org.apache.spark.ml.feature.MinHashLSH.

These two files seem the most relevant: MinHashLSH.scala and LSH.scala.

To use MinHashLSH, the doc says it needs a numHashTables parameter, which is the

Param for the number of hash tables used in LSH OR-amplification.

My understanding is that the hashFunction method calculates the MinHash signature for each input instance (e.g. a collection of singles/tokens that represent a document), e.g.

data = [
    (0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0])),
    (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0])),
]

df = spark.createDataFrame(data, ["id", "features"])

mh = MinHashLSH(inputCol='features', outputCol='hashes', numHashTables=5)
mh_model = mh.fit(df)
mh_model.transform(df).show(truncate=False)

outputs:

+---+-------------------------+---------------------------------------------------------------------------------+
|id |features                 |hashes                                                                           |
+---+-------------------------+---------------------------------------------------------------------------------+
|0  |(6,[0,1,2],[1.0,1.0,1.0])|[[3.84540858E8], [5.873031E8], [6.9646868E8], [3.95377821E8], [1.050129459E9]]   |
|1  |(6,[2,3,4],[1.0,1.0,1.0])|[[3.19950681E8], [1.87323383E8], [1.237542508E9], [6.11223988E8], [2.07958525E8]]|
+---+-------------------------+---------------------------------------------------------------------------------+

So numHashTables is the number of hash functions used/the length of the MinHash signature per input instance.

But I haven't seen anything related to the banding technique as introduced the chapter 3 of the book Mining of Massive Datasets in the code yet. The banding technique basically partitions all the MinHash values of each instance into b bands with r rows in each band, so the product of r and b should be equal to the number of hash functions used. The tuning of b and r has important implications in the inclusion of false positives or false negatives when calculating candidate pairs.

From the way I see how the sameBucket in approxNearestNeighbors and two-dataset join in approxSimilarityJoin are implemented, it looks the number of rows is always assumed to be one, then the number of bands is equal to the number of hash functions, i.e. numHashTables, which is also consistent with the description that numHashTables is used in LSH OR-amplification.

However, if r is always one, assume numHashTables=10, we could be computing the Jarccard distance (aka. keyDistance in MinHashLSH) for many more pairs (false positives, related code here) than we need to based on the equation

Pr = 1 - (1 - s^r)^b

where Pr is the probablity of a pair with Jaccard similarity (NOT distance) of s hashed to the same bucket.

So I'd like to confirm if my understanding is correct: the number of rows is assumed to be always 1 in the implementation of org.apache.spark.ml.feature.MinHashLSH, and the numHashTables is equal to both the number of hash functions used for calculating MinHash signature and the number of bands/hash tables used in locality sensitivity hash.

1

There are 1 answers

0
zyxue On

I see on https://spark.apache.org/docs/latest/ml-features#feature-transformation

The type of outputCol is Seq[Vector] where the dimension of the array equals numHashTables, and the dimensions of the vectors are currently set to 1. In future releases, we will implement AND-amplification so that users can specify the dimensions of these vectors.

so it looks indeed the number of rows (used for AND-amplification) in each band is set to 1.