A while ago, I was looking at cardinality estimators for use in a distributed setting – given a data set spread over a set of nodes, we want to compute the total number of unique keys without having to transfer all keys or a global bit signature. Counting sketches such as HyperLogLog (see here, here and here for an introduction) have superior memory usage and cpu performance when cardinality can be estimated with a small error margin. In the following, I summarize a comparison between the two Java libraries, StreamLib and Java-HLL, I did back in February 2014.

Methods

StreamLib implements several methods:

  • Linear counting (lincnt) - hashes values into positions in a bit vector and then estimates the number of items based on the number of unset bits.

  • LogLog (ll) - uses hashing to add an element to one of the m different estimators, and updates the maximum observed rank updateRegister(h >>> (Integer.SIZE - k), Integer.numberOfLeadingZeros((h << k) | (1 << (k - 1))) + 1)), where k = log2(m). The cardinality is estimated as Math.pow(2, Ravg) * a, where Ravg is the average maximum observed rank across the m registers and a is the a correction function for the given m (see the paper for details).

  • HyperLogLog (hll) - improves the LogLog algorithm by several aspects, for example by using harmonic mean.

  • HyperLogLog++ (hlp) - Google’s take on HLL that improves memory usage and accuracy for small cardinalities

Java-HLL (hlx) on the other hand provides a set of tweaks to HyperLogLog, mainly exploring the idea that a chunk of data, say 1280 bytes, can be used to fully represent a short sorted list, a sparse/lazy map of non-empty register, or a full register set (see the project page for details).

Performance comparison

I used two relatively small real-world data sets, similar to what was intended to be used in production. For hashing I used StreamLib’s MurmurHash.hash64, which for some reason did it better than Guava’s on the test data (I haven’t investigated the reason though). The latency times given below are cold-start numbers, measured with no respect to JIT and other issues. In other words, these are not scientific results.

Dataset A

The first data set has the following characteristics:

  • 3765844 tokens
  • 587913 unique keys (inserting into a Sets.newHashSet(): 977ms)
  • 587913 unique hashed keys (Sets.newHashSet(): 2520ms)

First lets compare the StreamLib methods tuned for 1% error with 10 mil keys. The collected data includes the name of the method, relative error, total estimator size, total elapsed time. The number behind ll, hll, hlp denotes the log2(m) parameter:

name error size time
lincnt 0.0017 137073B 1217ms
ll__14 0.0135 16384B 963ms
hll_13 0.0181 5472B 1000ms
hlp13 -0.0081 5473B 863ms

Here HLP performs best, with only 0.81% error and using only 5KB memory.

Now, lets compare StreamLib and Java-HLL. The parameter behind hlp is log2(m), while the parameters behind hlx are log2(m), register width (5 seems like the only one that works), promotion threshold (-1 denotes the auto mode) and the initial representation type.

name error size time
hlp10 0.0323 693B 818ms
hlp11 0.0153 1377B 967ms
hlp12 0.0132 2741B 790ms
hlp13 -0.0081 5473B 731ms
hlp14 -0.0081 10933B 697ms
hlx_105-1_FULL -0.0212 643B 723ms
hlx_105-1_SPARSE -0.0212 643B 680ms
hlx_115-1_FULL -0.0202 1283B 670ms
hlx_115-1_SPARSE -0.0202 1283B 710ms
hlx_125-1_FULL -0.0069 2563B 673ms
hlx_125-1_SPARSE -0.0069 2563B 699ms
hlx_135-1_FULL 0.0046 5123B 702ms
hlx_135-1_SPARSE 0.0046 5123B 672ms
hlx_145-1_FULL 0.0013 10243B 693ms
hlx_145-1_SPARSE 0.0013 10243B 678ms

Here Java-HLL is both more accurate and faster.

Dataset B

The second data set has the following characteristics:

  • 3765844 tokens
  • 2074012 unque keys (Sets.newHashSet(): 1195ms)
  • 2074012 unique hashed keys (Sets.newHashSet(): 2885ms)

StreamLib methods tuned for 1% error with 10 mil keys:

name error size time
lincnt 0.0005 137073B 663ms
ll__14 -0.0080 16384B 578ms
hll_13 0.0131 5472B 515ms
hlp13 -0.0118 5473B 566ms

And StreamLib vs Java-HLL:

name error size time
hlp10 0.0483 693B 560ms
hlp11 0.0336 1377B 489ms
hlp12 -0.0059 2741B 560ms
hlp13 -0.0118 5473B 567ms
hlp14 -0.0025 10933B 495ms
hlx_105-1_FULL -0.0227 643B 575ms
hlx_105-1_SPARSE -0.0227 643B 570ms
hlx_115-1_FULL -0.0194 1283B 505ms
hlx_115-1_SPARSE -0.0194 1283B 573ms
hlx_125-1_FULL -0.0076 2563B 500ms
hlx_125-1_SPARSE -0.0076 2563B 570ms
hlx_135-1_FULL -0.0099 5123B 576ms
hlx_135-1_SPARSE -0.0099 5123B 501ms
hlx_145-1_FULL 0.0015 10243B 572ms
hlx_145-1_SPARSE 0.0015 10243B 500ms

So the results are similar to those with Dataset A.

Conclusions

This comparison was done more than two years ago and I was quite skeptical to both frameworks. I found many strange thins in the StreamLib (both the reported issues and more), while Java-HLL did not work with other regsizes either. I settled for Java-HLL since it had a better implementation and gave better results. However, things change fast and StreamLib might have been improved a lot since then. I still want to look more at the code in both frameworks, and perhaps the frameworks that were published since then.

Nevertheless, HLL is clearly a method to use. A really nice feature of HLL is that you can have multiple counters and you can add (union) them together without loss. Intersection, however, can be tricky.

Open question

The register width in LogLog methods is the number of bits needed to represent the position maximum position of the first 1 bit. There are m = (beta / se)^2 such registers, where beta is a method-related constant and se is desired standard error, say 0.01. I guess this comes from StdErr = StdDev / sqrt(N) for a sample mean of a population (ref. wikipedia), but my knowledge of statistics is a bit too rusty to really understand this. Consequently, my understanding of the papers is that LogLog has beta = 1.30, HLL has beta = 1.106 and HLL++ has beta = 1.04, but I might be wrong. After all StreamLib code used these three numbers completely randomly in methods and tests. When I asked what was correct, they asked me back. Honestly, I don’t know :)

The Code

Read more

JMH caught my attention several times before, but I never had time to try it out. Although I have written several micro-benchmarks in the past, for example for my master thesis, I doubt that any of them were as technically correct as what JMH delivers. For a brief introduction I would highly recommend to look at this presentation, these blogposts – one, two, three, and the official code examples. More advanced examples can be found in Aleksey Shipilёv’s and Nitsan Wakart’s blog posts. In the following, I write a simple benchmark to test a hypothesis that bothered me for a while.

First, I generate the benchmark project with Maven and import it into Eclipse.

mvn archetype:generate -DinteractiveMode=false -DarchetypeGroupId=org.openjdk.jmh -DarchetypeArtifactId=jmh-java-benchmark-archetype -DgroupId=com.simonj -DartifactId=first-benchmark -Dversion=1.0

Then, I would like to test the following. Given an array of sequentially increasing integers and that we would like to count the number of distinct numbers, the simples solution is to use a for loop and a condition on the equality of the consecutive elements, in other words:

int uniques = numbers.length > 0 ? 1 : 0;
for (int i = 0; i < numbers.length - 1; i++) {
    if (numbers[i] != numbers[i + 1]) {
        uniques += 1;
    }
}

However, we can utilize the fact that the difference between the two consecutive and non-equal elements will be negative, and thus we can just shift the sign bit in the rightmost position and increment the counter by it, or in other words:

int uniques = numbers.length > 0 ? 1 : 0;
for (int i = 0; i < numbers.length - 1; i++) {
    uniques += (numbers[i] - numbers[i + 1]) >>> 31;
}

The latter eliminates the inner branch and therefore a potential branch penalty. Hence, it should be faster, but is it really so? That is exactly what we can test with the following benchmark:

Here, I try both variants on an array with 1 000 000 random numbers in range of 0 to bound. I try the following bounds, 1 000, 10 000, 100 000, 1 000 000, 10 000 000, 100 000 000, to simulate the actual cardinality of the generated data set. On my macbook, it gives the following results:

Benchmark                   (bound)   (size)  Mode  Cnt     Score     Error  Units
MyFirstBenchmark.testBranched       1000  1000000  avgt   20   764.576 ± 150.049  us/op
MyFirstBenchmark.testBranched      10000  1000000  avgt   20   837.783 ± 143.009  us/op
MyFirstBenchmark.testBranched     100000  1000000  avgt   20  1598.128 ±  17.773  us/op
MyFirstBenchmark.testBranched    1000000  1000000  avgt   20  1209.819 ±  39.535  us/op
MyFirstBenchmark.testBranched   10000000  1000000  avgt   20  1068.606 ±  42.052  us/op
MyFirstBenchmark.testBranched  100000000  1000000  avgt   20   625.952 ±  24.321  us/op
MyFirstBenchmark.testShifted        1000  1000000  avgt   20   973.910 ±  41.843  us/op
MyFirstBenchmark.testShifted       10000  1000000  avgt   20   966.573 ±  30.002  us/op
MyFirstBenchmark.testShifted      100000  1000000  avgt   20   966.102 ±  17.895  us/op
MyFirstBenchmark.testShifted     1000000  1000000  avgt   20   973.528 ±  24.396  us/op
MyFirstBenchmark.testShifted    10000000  1000000  avgt   20   957.287 ±  31.399  us/op
MyFirstBenchmark.testShifted   100000000  1000000  avgt   20  1049.479 ± 108.593  us/op

This shows that for both very large and very small cardinalities the branched code is significantly faster than the shifted one, although somewhere in the middle it is indeed significantly slower. The reason to this is of course the branch prediction (read the answers to this question on StackOverflow for details), and it illustrates exactly why we cannot assume that branch elimination by bit-twiddling will always improve the performance.

So much for the first try! This was easy and fun, and I will definitely look more into JMH in future. And by the way, JMH can also do code profiling via the -prof <profiler> and -lprof options (the latter lists the available profilers).

Read more

Apache Spark is an open source cluster computing framework, which is becoming extremely popular these days. By now it has taken over the role of many previously used MapReduce and Machine Learning frameworks. So far there exists plenty of recepies on how to launch a cluster and get the examples and shell running from there. Nevertheless, assume that for an educational purpose or any other odd reason we would like to build a single JAR, with all dependencies included, which then runs some Spark related code on its own. In that case, here is a simple four-step recipe to get started from scratch.

Create a new Maven Java project

The easiest way to do this is from the command line (look here for an explanation):

mvn archetype:generate -DgroupId=com.simonj.demo -DartifactId=spark-fun -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

Edit the POM file

In my example, I first explicitly state the Java version, 1.8. Then, I remove the junit dependency and add dependencies to spark-core_2.10, testng and guava (note the version 16.0 to avoid conflicts with the current version of spark-core). Finally, I use the Maven shade plugin to include the dependencies, with additional filters and transformers to get this stuff working.

Import the project into an IDE and edit the files

In the next step, I import the project into Eclipse and edit App.java and AppTest.java. The code illustrates a simple word counting in Spark, but the important part here is using something like the following (where I launch a new Spark context with a local master):

try (JavaSparkContext context = new JavaSparkContext("local[2]", "Spark fun!")) {
    ...
    context.stop();
}

Build the project and run

In the final step, I first build the project:

mvn clean package

Then create a test file, and run the App.java from the command line (note that here I use the allinone.jar, which is the one with all dependencies included):

java -cp target/spark-fun-1.0-SNAPSHOT-allinone.jar com.simonj.demo.App test.txt

Finally, after a short time the example program spits out something like this:

{took=1, lorem=4, but=1, text=2, is=1, standard=1, been=1, sheets=1, including=1, electronic=1, of=4, not=1, software=1, type=2, survived=1, book=1, only=1, s=1, desktop=1, to=1, passages=1, containing=1, and=3, versions=1, more=1, typesetting=2, essentially=1, recently=1, ipsum=4, a=2, galley=1, aldus=1, 1960s=1, simply=1, when=1, ever=1, dummy=2, with=2, 1500s=1, in=1, publishing=1, like=1, printing=1, five=1, industry=2, letraset=1, pagemaker=1, since=1, was=1, an=1, into=1, the=6, make=1, has=2, it=3, remaining=1, unknown=1, popularised=1, leap=1, unchanged=1, centuries=1, specimen=1, also=1, printer=1, release=1, scrambled=1}

So it works – what a lovely evening and good night folks!

PS. here is the complete project created through these steps.

Read more

Ten things I wish I knew before starting on a PhD

Taking a PhD is about defining yourself as an independent researcher, becoming a field expert and evolving as a person. A journey, with lots of pitfalls and setbacks, sometimes exciting and pleasant, sometimes exhausting and seemly impossible, mostly unplanned and completely unknown beforehand, but hopefully bringing you to the right place at the right time (that is your defense). At least, this is what my PhD was like. Below, I have written down ten things that I have learned mostly by trying and failing and then doing the opposite and nailing it. I hope some of these tips will be useful to the PhD students out there and perhaps make their journeys a little bit more straightforward.

So here it goes:

  1. Taking a PhD is not the same as being a student

    Getting a PhD is not about learning, but about finding things out on your own. Therefore, it does not really help to take any courses, even the mandatory ones (apart from just getting them off your requirements list). There are also no curriculum and no final exam. And most importantly, every minute counts, so spend it by doing actual research.

  2. Prepare to fail a lot and even to fail completely

    Good conferences have acceptance rates below 20%, which means that your paper is very likely to be rejected. Perhaps with a harsh criticism from an arrogant reviewer who thinks he is clever and funny. But hey, that is his job and a paper does not define you as a person anyway. So be thankful, use the feedback for improvement, resubmit, learn from it and move on.

    There is also a risk that your idea turns out not being novel or your experiments will show no advantage over the baseline. This happens a lot, but instead of losing it – learn from this, try to think of more contributions to convince the reader (that is the reviewer) that you work is worth knowing.

    Finally, you may fail completely – get bored, do a startup, run out of time, or a hundred of other things. This happens a lot and so it may to happen to you. Accept the possibility but know that focus and dedication will get you to the finish line.

  3. Start publishing and going to conferences early

    It is funny, but for most PhD’s the number of papers per year seems to grow exponentially. This means that you can easily double the number of papers by publishing your first paper one year earlier. Well, the catch is that your first paper will most likely be rejected. Nevertheless, in the worst case you will learn something from this, and in the best case you will go somewhere exotic, warm and nice, get some drinks and have some fun.

  4. Find the conferences you want to go to and read the best papers from the past two or three years …, and copy that

    Firstly, I am not talking about plagiarism – that will not get you anywhere good. Nevertheless, most of quality papers from the same conferences turn out being shockingly similar in their writing style, structure, terminology, methodology, data sets, etc. Copying that will get you quite far – if it looks like a good paper and sounds like a good paper, it must be a good paper!

    Secondly, this will give you some understanding of what is the interest area of the conference, what is hot and what you can try to improve, and of course what makes a paper really good.

    Finally, it is just so much more convenient to read a well-written related work section instead of digging up all scrap from the last ten years in the research field.

  5. Success is 30% novelty, 30% presentation and 30% luck

    You do not need a brilliant idea nor solve all the problems at once. However, you must show that your contributions are new and significant in one way or another, submit before the deadline, cross your fingers and start working on the next thing.

    Moreover, novelty is so important that some of the researchers I know have spent a lot of time looking for problems with a minimum or no previous work, and then writing a paper focusing on the novelty of the problem instead of the complexity of the solution. The biggest win of this approach is that in the next few years other researchers will use this work as a seemly naive baseline and therefore generate tons of citations.

  6. Minimize all programming, backup daily, and use version control, unit tests, checkstyle/findbugs/etc.

    Writing the most advanced code ever will not get you anywhere and only take the precious time from achieving your actual goal. However, it also sucks having to redo all your experiments due to a bug or not being able to remember when this or that part was added/changed and why.

    The best approach to research is actually this. First, find the conference you want to submit to – this will give you a deadline, the number of pages, etc. Then, find what would be a suitable idea and plan the evaluation. The best researchers I know will also write the whole paper apart from the experimental results at this point. Only then, implement the whole framework and do the experiments.

    Of course, as mentioned in the point two, your method may not beat the baseline. In that case, find out why it does not work, tackle that issue and add this to your contribution list. Lather, rinse and repeat…

    A version control system such as Git is a great tool for organizing your work (including your writing) and tracking your progress. With good commit messages you will able to find when any given part has been changed or what you did on any given day, in addition to keeping all possible versions of your work backed-up and neatly organized. Unit tests and code analysis tools will allow you to modify any part of your code without worrying about a newly introduced bug, thus keeping you more focused towards the deadline.

    Documenting your code and making it reusable as much as you can will speed-up your future work. However, it is also a highly time consuming task, thus you need to find the right balance. Consider also uploading your code to GitHub, thus letting others to learn and reuse.

    Finally, log everything you can from every experiment, not just the very minimum you wanted to present, because this will give you a future opportunity to extend your work without having to redo all the experiments.

  7. Collaborate and go abroad early

    I know the feeling: your supervisor is too busy, none of your family members and friends understands your work, all reviewers are against you, it takes a hell lot of time to publish a paper and none of your ideas seem to work. So what to do?

    – Collaborate! Talk to your fellows, even if they are working in completely different fields. Go to their presentations, get them into some kind of discussion, even if they look very annoyed and ask you to leave – oddly enough, this may result in a great paper! Find another institution that is doing great in your field and go there even without a pay. Find mentors, work with them and learn, and ask for more work after you get back!

  8. Become a reviewer

    As mentioned before, publishing a paper is primarily about convincing the reviewers that your work is great. For that you need to understand what makes a paper great, and the best approach for that is obviously reviewing other people’s papers.

    Of course, you can review any paper you find, but being a reviewer at a conference has a nice bonus – that is access to the not yet published work, which means you will know what are the topics the researchers out there are working on right now. And even better, you can help them to improve their work, which will bring you a great amount of professional satisfaction and self-realization.

  9. Keep calm and carry on

    You time is limited, most of the PhD programs are over three years – that is more than a thousand days even subtracted vacations and days off. So in the beginning it sounds like an infinite amount of time, but suddenly you find out that a half of that time has gone and you haven’t gotten anywhere. You slip into a state where you cannot think about work when you are there and you cannot think about anything else but work when doing anything else…

    – Ok, think this: Eventually it will work out. If not, there will be something else. One way or another, in five years from now you will be in a completely different state and none of this will matter at all. In fact, you will look back and remember only the best parts – that is the exotic places, banquets and drinks.

    For now, just focus on the small things – writing a good paragraph, finding a clever way to present a result, typing another block of code, running another experiment, enjoying the paper you read, etc. Whatever you do right now – focus only on this, and afterwards leave your work at the office.

  10. Enjoy your spare time, meet new people and friends

    Most PhD students struggle with loneliness. In fact, it is the greatest challenge you will encounter and the exact thing that turns your brain into a stressed monkey. So join a Judo club – some of those guys may become your best friends forever, and getting smashed against a mat for an hour and a half will surely keep you away from overestimating the importance of your research troubles. In fact, any other sport will do.

    Go out, meet a friend, have a drink and enjoy being single and young (that is the loud music and the smell of perfume and sweat). And surely talk to that beautiful girl that looks like Scarlet Johansson – little do you know, a few years later she will be your wife, and you will be sitting there 4 AM in the morning glued to an attention sick little baby that for some odd reason does not want to sleep, so tired that all the all–nighters and deadlines you pulled through your PhD sound like eight hours of the good old sleep. Or maybe not at all – she will just roll her eyes and leave, but take a chance anyway!

Read more

I was lucky to attend CIKM this year. This was the first research conference I went to since I defended my thesis, as well the first conference to attend under a non-academic affiliation. Besides giving an invited talk on things we do at Cxense right now presenting a paper on things I did at Yahoo! Labs a few years ago at the LSDS-IR workshop, meeting old colleagues and friends, I was highly interested in attending as many presentations I could. Especially, I was interested in the keynotes and industrial presentations, which there were many of. Here some of my favorites.

For the keynotes, Jaime Teevan held an exciting speech about so-called slow search, that is using more time for a better search experience; Xiaofang Zhou gave an interesting talk about mining in massive spatial trajectory datasets, some of its challenges and opportunities; and Andrew Tomkins held a mindblowing presentation about people making choice among several alternatives. Among these, I specially enjoyed Jaime’s talk for all the insightful examples, case studies and jokes.

Among the industry talks, I deeply enjoyed the Telecom data-mining talk held by Amy Shi-Nash about all the exciting work they do together with SingTel, Sofus Macskassy’s presentation about machine learning at Facebook and Wei-Ying Ma’s demonstration of the techniques used to build Xiaoice, the advanced chat-bot developed by Microsoft and becoming extremely popular in China these days.

After the conference, I had a short visit to Sydney and spent a week working at the Melbourne office of Cxense and bonding with my Australian colleagues, before heading back to my family in Norway. Overall, this was a nice trip, which I enjoyed pretty much despite a week of ten-hour jetlag and all the hustle.