https://github.com/axiomhq/hyperminhash/tree/master

#1 mynameisfiber:
This is great! Probabilistic datastructures are amazing.
I did straight up HLL++ a while ago but had to do intersection with the inclusionâ€“exclusion principle. It would be cool to try to fold this in somehow! (https://github.com/mynameisfiber/gohll).

#2 stochastic_monk:
Great work!
I haven't messed around with the countmin augmentation for HLLs, but I've had a lot of success in practice.
I've experimented with using HLL sketches for set operations for genome comparisons. The great thing about these operations is that the cost of operations scales with the size of the sketch, not the size of the dataset being sketched. Add SIMD acceleration, and you have an extremely fast, compact, accurate way to perform approximate set operations.
Unfortunately, the naive intersection operation (a sketch consisting of an elementwise 'min' of the counts in both sketches) does not perform well in cardinality estimation.
The recent Ertl paper [0] and associated code [1] puts a lot of effort into more accurate estimation and set operations. In my experiments, I've found that his modified estimation formula is always more accurate than the standard method and remains accurate to much higher cardinalities for a given sketch size.
My SIMDaccelerated, threadsafe HyperLogLog implementation in C++ (with python bindings) using this improved estimation method is available at [2]. Using this structure, I was able to perform over a billion fullgenome Jaccard index calculations overnight using `bonsai dist` from [3].
[0]: https://arxiv.org/pdf/1702.01284.pdf
[1]: https://github.com/oertl/hyperloglogsketchestimationpaper

#3 yunwilliamyu:
Thanks so much for producing a Golang implementation of our HyperMinHash paper (I'm the first author) [0] and pointing me to your Hackernews post!
As an aside, we also have a prototype Python implementation [1], but your implementation looks more efficient than ours, and I may start pointing people to your Github repo too.

#4 geocar:
What is this[1] supposed to do?
[1]: https://github.com/axiomhq/hyperminhash/blob/master/hypermin...

#5 djhworld:
Forgive me if this is a dumb question, but is this library a way of merging multiple HyperLogLogs together and still have a reasonable cardinality estimate?
What sort of loss do you see the more HyperLogLogs you merge? As in, if you had Set1,2,3,4,5.....n

#6 kgdinesh:
Useful library. I hope someone out there is writing one for Java as well.

#7 cliftoncburton:
GREAT LIBRARY.