[spark-hash学习]minhash算法实现细节

xiaoxiao2021-03-25 71

Implementation Details

Implementation of LSH follows the rough steps

minhash each vector some number of times. The number of times to hash is an input parameter. The hashing function is defined in com.invincea.spark.hash.Hasher. Essentially each element of the input vector is hashed and the minimum hash value for the vector is returned. Minhashing produces a set of signatures for each vector.Chop up each vector's minhash signatures into bands where each band contains an equal number of signatures. Bands with a greater number of signatures will produce clusters withgreater similarity. A greater number of bands will increase the probabilty that similar vector signatures hash to the same value.Order each of the vector bands such that for each band the vector's data for that band are grouped together.Hash each band and group similar values. These similar values for a given band that hash to the same value are added to the result set.Optionally filter results. An example operation would be to filter out singleton sets.

实现细节

LSH的实现遵循以下粗略的步骤：

1. 对每个向量(vector)进行若干次minhash。hash的次数是一个待输入的参数(numRows)。哈希函数已经在com.invincea.spark.hash.Hasher中进行了定义。基本上输入向量的每一个元素都被hash若干次，并且对于一个向量的最小的hash值会被返回。Minhashing对每个向量会生成一系列的指纹签名。

2. 将每个向量的minhash指纹签名分成若干个band，其中每个band包含有相同数量的指纹签名。具有更多数量指纹签名的band将会产生具有更高相似度的cluster。更多数量的band将会增加相似向量指纹签名也被hash成相同值的可能性。

3. Order each of the vector bands such that for each band the vector's data for that band are grouped together.

4. hash每个band并将相似值组成组。这些对于一个给定band来说相似的值（即hash得到相同值）被添加到结果集中。

5.你也可以选择过滤以下结果集。一个示例操作将会过滤掉单集。

原文链接：

https://github.com/mrsqueeze/spark-hash#implementation-details

转载请注明原文地址: https://ju.6miu.com/read-33362.html

技术

最新回复(0)