Efficient MinHashing
copied from cf-staging / pyminhashMinHashing is a very efficient way of finding similar records in a dataset based on Jaccard similarity. PyMinHash implements efficient minhashing for Pandas dataframes. See instructions below or look at the example notebook to get started.
Developed by Frits Hermans