A while ago, I was looking at cardinality estimators for use in a distributed setting – given a data set spread over a set of nodes, we want to compute the total number of unique keys without having to transfer all keys or a global bit signature. Counting sketches such as HyperLogLog (see here, here and here for an introduction) have superior memory usage and cpu performance when cardinality can be estimated with a small error margin. In the following, I summarize a comparison between the two Java libraries, StreamLib and Java-HLL, I did back in February 2014.
StreamLib implements several methods:
Linear counting (
lincnt) - hashes values into positions in a bit vector and then estimates the number of items based on the number of unset bits.
ll) - uses hashing to add an element to one of the m different estimators, and updates the maximum observed rank
updateRegister(h >>> (Integer.SIZE - k),
Integer.numberOfLeadingZeros((h << k) | (1 << (k - 1))) + 1)), where
k = log2(m). The cardinality is estimated as
Math.pow(2, Ravg) * a, where Ravg is the average maximum observed rank across the m registers and a is the a correction function for the given m (see the paper for details).
hll) - improves the LogLog algorithm by several aspects, for example by using harmonic mean.
hlp) - Google’s take on HLL that improves memory usage and accuracy for small cardinalities
hlx) on the other hand provides a set of tweaks to HyperLogLog, mainly exploring the idea that a chunk of data, say 1280 bytes, can be used to fully represent a short sorted list, a sparse/lazy map of non-empty register, or a full register set (see the project page for details).
I used two relatively small real-world data sets, similar to what was intended to be used in production. For hashing I used StreamLib’s
MurmurHash.hash64, which for some reason did it better than Guava’s on the test data (I haven’t investigated the reason though). The latency times given below are cold-start numbers, measured with no respect to JIT and other issues. In other words, these are not scientific results.
The first data set has the following characteristics:
- 3765844 tokens
- 587913 unique keys (inserting into a
- 587913 unique hashed keys (
First lets compare the StreamLib methods tuned for 1% error with 10 mil keys. The collected data includes the name of the method, relative error, total estimator size, total elapsed time. The number behind
hlp denotes the
Here HLP performs best, with only 0.81% error and using only 5KB memory.
Now, lets compare StreamLib and Java-HLL. The parameter behind
log2(m), while the parameters behind
log2(m), register width (5 seems like the only one that works), promotion threshold (-1 denotes the
auto mode) and the initial representation type.
Here Java-HLL is both more accurate and faster.
The second data set has the following characteristics:
- 3765844 tokens
- 2074012 unque keys (
- 2074012 unique hashed keys (
StreamLib methods tuned for 1% error with 10 mil keys:
And StreamLib vs Java-HLL:
So the results are similar to those with Dataset A.
This comparison was done more than two years ago and I was quite skeptical to both frameworks. I found many strange thins in the StreamLib (both the reported issues and more), while Java-HLL did not work with other regsizes either. I settled for Java-HLL since it had a better implementation and gave better results. However, things change fast and StreamLib might have been improved a lot since then. I still want to look more at the code in both frameworks, and perhaps the frameworks that were published since then.
Nevertheless, HLL is clearly a method to use. A really nice feature of HLL is that you can have multiple counters and you can add (union) them together without loss. Intersection, however, can be tricky.
The register width in LogLog methods is the number of bits needed to represent the position maximum position of the first 1 bit. There are
m = (beta / se)^2 such registers, where beta is a method-related constant and se is desired standard error, say 0.01. I guess this comes from
StdErr = StdDev / sqrt(N) for a sample mean of a population (ref. wikipedia), but my knowledge of statistics is a bit too rusty to really understand this. Consequently, my understanding of the papers is that LogLog has
beta = 1.30, HLL has
beta = 1.106 and HLL++ has
beta = 1.04, but I might be wrong. After all StreamLib code used these three numbers completely randomly in methods and tests. When I asked what was correct, they asked me back. Honestly, I don’t know :)