I'm trying to understand the MinHash LSH implementation in Spark, org.apache.spark.ml.feature.MinHashLSH
.
These two files seem the most relevant: MinHashLSH.scala and LSH.scala.
To use MinHashLSH, the doc says it needs a numHashTables
parameter, which is the
Param for the number of hash tables used in LSH OR-amplification.
My understanding is that the hashFunction method calculates the MinHash signature for each input instance (e.g. a collection of singles/tokens that represent a document), e.g.
data = [
(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0])),
(1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0])),
]
df = spark.createDataFrame(data, ["id", "features"])
mh = MinHashLSH(inputCol='features', outputCol='hashes', numHashTables=5)
mh_model = mh.fit(df)
mh_model.transform(df).show(truncate=False)
outputs:
+---+-------------------------+---------------------------------------------------------------------------------+
|id |features |hashes |
+---+-------------------------+---------------------------------------------------------------------------------+
|0 |(6,[0,1,2],[1.0,1.0,1.0])|[[3.84540858E8], [5.873031E8], [6.9646868E8], [3.95377821E8], [1.050129459E9]] |
|1 |(6,[2,3,4],[1.0,1.0,1.0])|[[3.19950681E8], [1.87323383E8], [1.237542508E9], [6.11223988E8], [2.07958525E8]]|
+---+-------------------------+---------------------------------------------------------------------------------+
So numHashTables
is the number of hash functions used/the length of the MinHash signature per input instance.
But I haven't seen anything related to the banding technique as introduced the chapter 3 of the book Mining of Massive Datasets in the code yet. The banding technique basically partitions all the MinHash values of each instance into b
bands with r
rows in each band, so the product of r
and b
should be equal to the number of hash functions used. The tuning of b
and r
has important implications in the inclusion of false positives or false negatives when calculating candidate pairs.
From the way I see how the sameBucket in approxNearestNeighbors
and two-dataset join in approxSimilarityJoin
are implemented, it looks the number of rows is always assumed to be one, then the number of bands is equal to the number of hash functions, i.e. numHashTables
, which is also consistent with the description that numHashTables
is used in LSH OR-amplification.
However, if r
is always one, assume numHashTables=10
, we could be computing the Jarccard distance (aka. keyDistance
in MinHashLSH
) for many more pairs (false positives, related code here) than we need to based on the equation
Pr = 1 - (1 - s^r)^b
where Pr
is the probablity of a pair with Jaccard similarity (NOT distance) of s
hashed to the same bucket.
So I'd like to confirm if my understanding is correct: the number of rows is assumed to be always 1 in the implementation of org.apache.spark.ml.feature.MinHashLSH
, and the numHashTables
is equal to both the number of hash functions used for calculating MinHash signature and the number of bands/hash tables used in locality sensitivity hash.
I see on https://spark.apache.org/docs/latest/ml-features#feature-transformation
so it looks indeed the number of rows (used for AND-amplification) in each band is set to 1.