Can instanceof make Unmodifiable Collections faster?

Fri 22 March 2019 Nicolai Parlog

In a battle-hardened code base, all normal bugs were fixed - those that remain are usually quite bizarre. This is a story of one such bug and how it uncovered a less than optimal implementation in the JDK itself.

The Bug

Do you know those bugs that only happen after the program ran for a while? Hibernate's Gunnar Morling recently told the story of one that, after a few days, would lead to a StackOverflowException on calling Set::isEmpty. He tracked it down to a faulty implementation of a cache for sets: On each access, the cache would obtain a Set, wrap it in Collections::unmodifiableSet, and then reinsert it, thus piling more and more unmodifiable wrappers on the underlying set.

Thing is, the methods on the wrapper, take isEmpty as an example, can't do anything themselves and need to forward to the underlying collection:

// actual implementation of
// java.util.Collections.UnmodifiableCollection#isEmpty
// (`c` is the wrapped collection)
public boolean isEmpty() {
    return c.isEmpty();
}

Pile wrappers long enough, each forwarding method calls to the next, and you will overflow the stack. It's turtles all the way down.

The Reason

The immediate reason for the bug was the faulty cache implementation, but there's something else here: Why does Collections::unmodifiableSet wrap an already unmodifiable set? Couldn't it check whether the specified set is already unmodifiable and return it as-is if that's the case? From a technical point of view, nothing speaks against that, and the method's contract doesn't call for returning a new instance either.

The remaining question is, does it make a difference?

The Benchmark

Surely, forwarding method calls is a waste of time, right? Or is there some clever optimization that can nullify that effect? One straightforward way to find out is to benchmark it. Simply wrap a Set into unmodifiable wrappers and check how long various operations take. Besides iteration and isEmpty, I also want to benchmark the effect on linear and random access, so I test lists as well as sets.

(You can find the code on GitHub and the results in this Google Spreadsheet. I use JHM - if you want to learn about it, check out the samples.)

Setup

Here are my specs:

CPU: Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz
RAM: Samsung DDR3 16GB @ 1.60GHz (the tests ran entirely in RAM)
OS: Gentoo, kernel version 4.14.83-gentoo
Java: OpenJDK 11.0.1
JMH: 1.21

To set up the benchmarks, I create collections with size elements and wrap them depth times. Putting aside efforts to reuse code, this is what it looks like for HashSet:

// create the underlying collection with `size` elements
Set<Integer> underlyingCollection = new Random()
    .ints(size).mapToObj(i -> i)
    .collect(Collectors.toCollection(HashSet::new));

// wrap it `depth` times - benchmarks use `unmodifiableCollection`
Set<Integer> unmodifiableCollection = underlyingCollection;
for (int i = 0; i < depth; i++)
    unmodifiableCollection = Collections.unmodifiableSet(unmodifiableCollection);
return unmodifiableCollection;

For lists it looks very similar, but in addition, I create int[] getAts. This int array has the same length as the list and contains random list indices. I will use it to access the wrapped list in an unpredictable pattern.

Benchmark methods

These are the benchmark methods for both lists and sets:

@Benchmark
public boolean isEmpty() {
    return unmodifiableCollection.isEmpty();
}

@Benchmark
public void forEach(Blackhole hole) {
    unmodifiableCollection.forEach(hole::consume);
}

@Benchmark
public void iterator(Blackhole hole) {
    for (int i : unmodifiableCollection)
        hole.consume(i);
}

A few words on the JMH features I use here:

@Benchmark lets JMH know that I want it to benchmark that method
returning the result of calling isEmpty lets JMH ensure that the JIT compiler doesn't see the method is dead code (i.e. has no effect and is hence redundant), in which case it would eliminate it
sending the list's elements into Blackhole is a more pedestrian approach to the same effect

On lists, there are two more things I want to benchmark - linear and random access:

@Benchmark
public void linearAccess(Blackhole hole) {
    List<Integer> list = unmodifiableCollection();
    for (int i = 0; i < list.size(); i++)
        hole.consume(list.get(i));
}

@Benchmark
public void randomAccess(Blackhole hole) {
    List<Integer> list = unmodifiableCollection();
    for (int at : getAts)
        hole.consume(list.get(at));
}

Enough with the build up - what are the results?!

Measurements

I ran the banchmarks with varying collections (HashSet, LinkedHashSet, ArrayList, LinkedList), collection sizes (10k, 100k, 1m), and wrapping dephts (0, 1, 10, 1000; I also dabbled with 10k, but it would often lead to an StackOverflowException). Keeping in mind that it's the last factor we're really interested in, here are a few observations that reduce the amount of data we have to look at:

the collection type has no relevant effect; of course different collections have different absolute run times, but the wrapping depth always has a very similar influence
the collection size has the effect you would expect, independent of wrapping depth:
- for a constant operation like isEmpty, collection size doesn't matter
- for iterating operations, each 10x increase in size roughly leads to a 10x increase in run time (a notable exception seems to be HashSet, where the factor seems to be about x15, but let's not dwell on that here)

Now let's look at how run times depend on wrapping depth. The following data is for an ArrayList with a million elements, but as we've just established, it's representative of other collection types and sizes.

Benchmark	Depth	Score Error Units
forEach	0	6499.524 ± 116.113 μs/op
forEach	1	6466.625 ± 79.003 μs/op
forEach	10	6572.141 ± 49.209 μs/op
forEach	100	6429.473 ± 35.505 μs/op
forEach	1000	6756.348 ± 103.719 μs/op
isEmpty	0	0.003 ± 0.001 μs/op
isEmpty	1	0.004 ± 0.001 μs/op
isEmpty	10	0.015 ± 0.001 μs/op
isEmpty	100	0.205 ± 0.004 μs/op
isEmpty	1000	2.420 ± 0.020 μs/op
iterator	0	5729.565 ± 429.593 μs/op
iterator	1	7343.901 ± 334.612 μs/op
iterator	10	35199.093 ± 285.387 μs/op
iterator	100	417297.905 ± 9552.693 μs/op
iterator	1000	4778937.048 ± 46199.758 μs/op
linearAccess	0	3774.128 ± 39.618 μs/op
linearAccess	1	4714.176 ± 22.733 μs/op
linearAccess	10	30394.111 ± 218.719 μs/op
linearAccess	100	417612.502 ± 4279.777 μs/op
linearAccess	1000	4661914.890 ± 30868.705 μs/op
randomAccess	0	7231.825 ± 587.417 μs/op
randomAccess	1	7406.953 ± 692.301 μs/op
randomAccess	10	26135.629 ± 688.995 μs/op
randomAccess	100	243328.907 ± 6699.941 μs/op
randomAccess	1000	2457828.925 ± 17852.538 μs/op

Let's start with isEmpty. It has virtually no cost, but forwarding the call down the seemingly endless chain of wrappers does. You can see how that cost scales roughly linearly with wrapping depth.

Then, take a look at forEach. At first, it may seem surprising that its run time is so stable. Then again, consider the implementation in the unmodifiable wrapper:

public void forEach(Consumer<? super E> action) {
    c.forEach(action);
}

As you can see, it doesn't really do anything except forwarding the given consumer to the next wrapper. We can see that iterating over a million elements takes about 6500 μs and we know from isEmpty that forwarding a call down a thousand wrappers takes about 2μs. Given those numbers, the stable performance isn't surprising after all.

Finally, let's have a look at iteration, linear access, and random access. Unsurprisingly randomAccess starts off slower than linearAccess, because it jumps all over the underlying data structure and hence constantly produces cache misses. But then random access catches up and eventually overtakes linear access - what the hell?! Take a look at the methods' code again: While linearAccess has two forwarding calls (to size and get), randomAccess has only one (to get), so it stands to reason that - as unwrapping begins to dominate execution time - the added cost of cache misses become less relevant.

Comparing different methods, we can deduce that the first wrapper (i.e. depth 1) has a measurable, but not dominating performance impact. Most importantly, though:

There's a clear linear relation between wrapping depth and run time for all iteration methods except forEach.

And while wrapping a collection often enough to crash the JVM is surely the consequence of a bug, wrapping it a few, maybe even a dozen times, can happen easily in an environment where constructors routinely wrap incoming lists.

Benchmarking Instance Checks

It is easy to avoid unmodifiable collections from piling up - simply add an instanceof check to Collections::unmodifiable... that returns the given instance if it already is unmodifiable:

public static <T> Set<T> unmodifiableSetWithInstanceCheck(Set<T> s) {
    if (s instanceof UnmodifiableSet)
        return s;
    return new UnmodifiableSet<>(s);
}

While the change is easy, is it also cheap? Here are the results of a benchmark that wraps a Set 1, 10, 100, 1k, and 10k times:

Benchmark	Depth	Score Error Units
originalUnmodifiable	1	8.508 ± 0.770 ns/op
originalUnmodifiable	10	50.997 ± 2.696 ns/op
originalUnmodifiable	100	413.526 ± 74.531 ns/op
originalUnmodifiable	1_000	3405.815 ± 382.755 ns/op
originalUnmodifiable	10_000	32218.869 ± 3729.403 ns/op
withInstanceOfCheck	1	6.236 ± 0.514 ns/op
withInstanceOfCheck	10	15.241 ± 0.686 ns/op
withInstanceOfCheck	100	108.349 ± 1.621 ns/op
withInstanceOfCheck	1_000	964.785 ± 19.586 ns/op
withInstanceOfCheck	10_000	9575.831 ± 197.473 ns/op

Wait, what? How is the instance check faster?! Here's my theory:

while we have the additional instanceOf check, we avoid the creation of unnecessary UnmodifiableSet instances (memory allocation is dirt cheap, but it's not free)
we greatly reduce pressure on the garbage collector because we create way fewer instances

The problem with that theory is that it doesn't explain why withInstanceOfCheck is faster even for depth 1. Investigating that goes beyond the scope of this article, though, and so I'll leave that as an exercise for the reader. Or a future post.

The Conclusion

As we have seen, wrapping collections with Collections.unmodifiable... once has next to no effect, but if it happens a few times on performance critical collections, it can harm performance. Since the technical and performance side as well as the methods' contracts allow an implementation where already unmodifiable collections aren't wrapped again, I think that change should be made.

NB: The copyOf methods on List, Set, and Map (introduced in Java 10) already behave this way, so they do not suffer from the same drawback.

The Bug

The Reason

The Benchmark

Setup

Benchmark methods

Measurements

Benchmarking Instance Checks

The Conclusion

Related articles

Opsian talks to Aleksey Shipilëv about Shenandoah and Concurrent GCs

Opsian talks to Aleksey Shipilëv about JDK updates

What is Continuous Profiling?

VirtualJUG tutorial on Continuous Profiling in Java