Jim Driscoll's Blog

Notes on Technology and the Web

Bacon Numbers with Java 8, part 2

leave a comment »

In my previous post, I started working on the following problem:

Given a large hash table whose keys are movie names and whose values 
are a list of actors in those movies, write a function to determine 
the Bacon Number of a particular actor.

So far, we’ve used Stream.generate, working with Random.nextGaussian to create a sample dataset, in the format Set<List<Integer>> – we’re using a Set since we (conveniently) don’t need to pay attention to the Movie Names.

Now that we have the data, we’re going to want to search it. But before we do, it’d be nice to be able to get a feel for the performance of the different algorithms we’re going to write. That means writing something that’s a useful micro benchmark tester. By functional interfaces in JDK 8, and providing lambdas, it’s pretty easy.

First, since we’ll be producing a summary of runtimes, create a function that prints out LongSummaryStatistics:

private void printStats(String name, String append, LongSummaryStatistics stats) {
    System.out.println(name + " Num Runs: " + stats.getCount());
    System.out.println(name + " Sum      " + append + ": " + stats.getSum());
    System.out.println(name + " Average  " + append + ": " + stats.getAverage());
    System.out.println(name + " Min      " + append + ": " + stats.getMin());
    System.out.println(name + " Max      " + append + ": " + stats.getMax());
}

These five method calls print out all the basics. It’d be nice if it also offered stats like the median and the mode, but we could always extend LongSummaryStatistics if we felt ambitious.

With that display out of the way, let’s take a first stab at creating a benchmark method:

private void benchmark(String name, Supplier supplier) throws Exception {
    // first run, warmup
    supplier.get();
    List timings = new ArrayList(NUMRUNS);
    for (int i = 0; i  name + " incremental time: " + (Duration.between(start, end).toMillis()));
    }
    printStats(name, "run time", timings.stream().mapToLong((l) -> (long) l).summaryStatistics());
}

This method takes a Supplier as a parameter. Then, it runs it once (to warm up any underlying JDK structures), and then runs it NUMRUNS times, saving the timings in a List. Running the supplied function is as simple as calling Supplier.get. To get those timings, we’re using the new Date/Time API in JDK 8, calling Instant.now, and getting the difference via Duration.between and Duration.toMillis. Though those calls are probably new to you, I bet you know what they do without explanation (I like the new Date/Time API for that very reason). There’s also a call to Logger.getGlobal in there, so we can examine intermediate results by turning logging on to the INFO level – and it takes a Supplier function as well, so we no longer have to provide a if/then for the logger – sweet!

A really good micro benchmark method would also strip the lowest and highest values, but I was interested in making this easy to understand, so I’ve left it out – but you could do that by taking the Stream of Long values and running a reduce operation on them.

To use this method with the method we’ve already developed to generate a list of Movies, we could just call it like so: benchmark("createMovies", this::createMovies); – this passes in a method reference. Since we’re not passing any arguments to create the movies (the values that we use for tuning are all set via statics), we don’t even need to create a lambda.

So, that covers the case for testing creating the movies. In fact, by using it, we can find out that creating a Gaussian distribution for our dataset takes about twice as long as a flat dataset, and trying to run it in parallel makes it slower (since Stream.generate doesn’t have a corresponding ParallelStream.generate, the parallelization happens in the wrong place, you’d need to use Futures instead).

What about testing the graph transform we’re about to write? We could just do something like this:

final Set<List> movies = createDistMovies();
benchmark("transformGraph constant data", () -> transformGraph(movies));

That will work – but will only execute the transform on the same data, over and over. Suppose you’d like to create a new dataset for each run – this would work:

benchmark("transformGraph constant data", () -> transformGraph(createDistMovies()));

But then you’d be measuring the wrong thing – you’d include the creation of the dataset as part of the measure, and that’s not right. No, instead we should do a something like this:

private <T, S> void benchmark(String name, Supplier<T> source, Function<T, S> sink) throws Exception {
    // first run, warmup
    sink.apply(source.get());
    List timings = new ArrayList(NUMRUNS);
    for (int i = 0; i  name + " incremental time: " + (Duration.between(start, end).toMillis()));
    }
    printStats(name, "run time", timings.stream().mapToLong((l) -> (long) l).summaryStatistics());
}

So, now we’re providing a Supplier to generate our data by calling Supplier.get (returning type T), and providing a Function which takes that data (also of type T), and transforms it by running Function.apply (returning type S). Neat, right? Otherwise, it’s pretty much doing the same thing as the previous version.

To use it, we would say:

benchmark("transformGraph dist data", this::createDistMovies, this::transformGraph);

Let’s expand on that, and think one more step ahead. After we have the transformed data, we’re going to want to run the actual search algorithm on it. Additionally, since we’re going to pick the start and end points at random, we might want to screen out values where the the two aren’t actually connected, or are connected in the first run – that means we’ll want to add a test to check for that. It’s actually pretty easy to do all that:

private <T, U> void benchmark(String name, 
                              Supplier<T> source, 
                              Function<T, U> sink, 
                              Predicate<U> ignoreresultif) throws Exception {
    // first run, warmup
    sink.apply(source.get());
    List<Long> timings = new ArrayList(NUMRUNS);
    List<U> results = new ArrayList(NUMRUNS);
    U result = null;
    Instant start, end;
    for (int i = 0; i < NUMRUNS; i++) {
        do {
            T t = source.get();
            rest();
            start = Instant.now();
            result = sink.apply(t);
            end = Instant.now();
            //log result, if appropriate
            final U res = result;
            Logger.getGlobal().info(() -> name + " incremental result: " + res);
            Instant s = start;
            Instant e = end;
            Logger.getGlobal().info(() -> name + " incremental time: " + 
                                       (Duration.between(s, e).toMillis()));
        } while (ignoreresultif.test(result));
        timings.add(Duration.between(start, end).toMillis());
        results.add(result);
    }
    if (result instanceof Integer) {
        printStats(name, "results", results.stream().mapToLong((num) -> ((Integer) num).intValue()).summaryStatistics());
    }
    printStats(name, "run time", timings.stream().mapToLong((l) -> (long) l).summaryStatistics());
}

So, same concept – we’ve got a Supplier as the source of data, there’s a Function which does something to that data, and returns a result, same as in the previous example. This time, though, we have a test, defined by the Predicate, which will ignore the result if the Predicate.test method returns true. Also, because we’re using the result, start and end values outside of the loop, the JDK doesn’t consider those ‘effectively final’ – so using them in the logging lambdas requires us to save them to another value before the log message (which is actually kind of a drag).

Now you can use this method to do some pretty sophisticated calls for running rough benchmarking on what’s going to be our finished product:

benchmark("breadthfirst variable data", () -> {
        Set<List> m = createDistMovies();
        return transformGraphWithParallelStreams(m);
    }, this::breadthFirst, (result) -> result < 1);

The Supplier is a lambda which creates the sample data, transforms it, and returns it, the next Function is our search function, which will (implicitly) be fed that result (how convenient is that!), and then will have it's result tested with the Predicate lambda – it'll ignore the result if it ended up being too easy of a search.

Next up, we'll transform the data to get it ready to search, including using a parallel stream.

Written by jamesgdriscoll

January 11, 2014 at 12:06 PM

Leave a comment