Jim Driscoll's Blog

Notes on Technology and the Web

Bacon Numbers with Java 8, part 1

leave a comment »

This is the fourth, and in all likelihood last, interview question that I’ll be using to explore JDK 8. It’s fun to look at new tech, but it’ll probably be years before I get to use JDK 8 at my current job. In case you’ve missed them, so far I’ve covered:

For my fourth question, I’m going to cover the following sample interview question: “Given a large hash table whose keys are movie names and whose values are a list of actors in those movies, write a function to determine the Bacon Number of a particular actor.”

If you’re unfamiliar with what a Bacon Number is, the internet is full of pretty good write-ups – I found a nice one with just a few clicks. Suffice it to say that it’s a pretty standard social graph problem.

Now, the question itself is the kind of “incorrectly-structured Graph” question that’s always given me problems during actual interviews – mostly because I pretty much never need to do anything like this at work. One of the nice things about working on generalized framework code is that you tend to have much better control of the inputs than someone working closer to the final problem to be solved. But I think I’ve found a way (for me, at least) to think about these things that makes more intuitive sense – as data structure problems.

As presented, this data structure is pretty classical unnormalized data – the Strings representing the Actors names are repeated everywhere, and there’s no link between the actors in any way that you could easily search.

Also, the question specifically doesn’t ask for the full answer to the “Six Degrees Of Kevin Bacon” party game – the questioner doesn’t care at all about the movie titles. So, one obvious answer would be to “normalize” the data down to a series of integers, each one acting as an ID to the actor, and then build a relationship table. Let’s do that – and after that, we should be able to dispatch this with a fairly simple breadth first function.

First, we can use the new JDK 8 Stream code to fairly easily produce a list of “movies”. We can use that for all our subsequent functional and performance testing.

Random random = new Random(System.currentTimeMillis());
int spread = MAXNUMACTORSPERMOVIE - MINNUMACTORSPERMOVIE + 1;

Set<List<Integer>> movies =  Collections.unmodifiableSet(
    Stream.generate(
        // make a stream of ramdom numbers, arranged evenly between 1-NUMACTORS
        () -> random.ints(1, NUMACTORS + 1)
        .distinct()
        // the stream should be MINNUMACTORSPERMOVIE - MAXNUMACTORSPERMOVIE in length
        .limit(random.nextInt(spread) + MINNUMACTORSPERMOVIE)
        // turn the stream into a List of Integers
        .boxed().collect(Collectors.toList()))
    // only create NUMMOVIES Lists 
    .limit(NUMMOVIES)
    // put those lists in a set
    .collect(Collectors.toSet())
);

If you’ve been reading my previous blogs, there’s not much new here: Create a new Stream, make that Stream consist of another Stream (generated by random.ints) which will be numbers between 1 and NUMACTORS, have no duplicates (distant), and be the correct length (limit). Box that Stream of int into Stream<Integer>, and collect it into a List. That means that the original (outer) Stream is actually a Stream<List> – which in turn we limit to the appropriate amount, and bundle up into a Set.

Creating a list of 300k ‘movies’ takes less than a second, and I regard that as plenty fast. To see the size of the Collection it produces, by the way, you could execute this:

movies.parallelStream().unordered().mapToLong((l) -> (long) l.size()).summaryStatistics()

On the aforementioned set of 300k movies, that takes about 40ms to run, and will let you know that those 300k movies contain 3.8 million “roles”.

You can also print out the data using a Stream, with a StringBuffer:

private void printMovieActors(Set<List<Integer>> set) {
    StringBuilder sb = new StringBuilder("Movies: ");
    set.stream().limit(MAXPRINT).forEach((l) -> {
        sb.append("\nMovie: \n");
        l.stream().forEach((s) -> sb.append(s).append(", "));
    });
    System.out.println(sb.toString());
}

That’s alright for sample data, but we could do a bit better: By using a simple random number generator, we created a flat distribution – there are some actors with a large number of roles, but not many.

One way to deal with that is to use a Gaussian distribution for the random numbers. That’s generated this way:

Random random = new Random(System.currentTimeMillis());
long midpoint = NUMACTORS / 2;
int spread = MAXNUMACTORSPERMOVIE - MINNUMACTORSPERMOVIE + 1;

Set<List<Integer>> movies = Collections.unmodifiableSet(
    Stream.generate(
        // make a stream of random numbers, distributed in a curve, between 1-NUMACTORS
        () -> Stream.generate(() -> Math.round(random.nextGaussian() * ONESIGMA + midpoint))
            .map((l) -> l.intValue())
            // remove any unsuitable values, both dups and out of range
            .filter((i) -> i > 0 && i <= NUMACTORS).distinct()
            // the stream should be MINNUMACTORSPERMOVIE - MAXNUMACTORSPERMOVIE in length
            .limit(random.nextInt(spread) + MINNUMACTORSPERMOVIE)
            // turn the stream into a List of Integers
            .collect(Collectors.toList()))
        // only create NUMMOVIES Lists 
        .limit(NUMMOVIES)
        // put those lists in a set
        .collect(Collectors.toSet())
    );

This is approximately the same code as before – instead of producing a an IntStream via Random.ints, we use a second Stream.generate – this calls a function which gets numbers that are more common in the middle than on the ends – random.nextGaussian returns a number which is going to land between -1 and +1 67% of the time. The rest of the time, it will land further and further away from zero, with decreasing probability.

So, by picking the value of ONESIGMA in the code above, we can control how big a group of “power players” we have in our simulated data set. (Actually, they’re called “super connectors” in social graph theory. Same thing.) The rest of the code consists of a filter to guard against values that are outside our allowable range (the Guassian number generator can generate pretty much any number), and a map to change the returned value from a long to an int (we could have done that in the generated code, but I wanted to show one use for a map transform).

In my sample code, I have a value for ONESIGMA of 10% of the total actors:

private final int ONESIGMA = NUMACTORS / 10; // 10% of actors get 67% of work

But that’s probably overly generous – from what I’ve heard, the number prolific actors is actually far, far smaller, and almost everyone in SAAG has only worked a handful of paying gigs.

Now that we have our data, we should use it! But this is already long, and so I’ll examine using this newly created sample data next time.

Written by jamesgdriscoll

January 9, 2014 at 6:17 PM

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 413 other followers

%d bloggers like this: