The JVM's mysterious AllocatePrefetch options: what do they actually do?

Wed 29 January 2020 Sadiq Jaffer & Richard Warburton

The HotSpot JVM comes with a range of non-standard -XX: options, many of which have an impact on performance. One set are the family of so-called AllocatePrefetch options comprising: -XX:AllocatePrefetchStyle, -XX:AllocatePrefetchStepSize, -XX:AllocatePrefetchLines, -XX:AllocatePrefetchInstr, -XX:AllocatePrefetchDistance and -XX:AllocateInstancePrefetchLines. In this blog post you’ll learn the background behind why AllocatePrefetch is necessary and how it can help performance.

Allocation

To begin to understand how these options work, let’s break AllocatePrefetch apart and start with how allocation works in the JVM. All Java Garbage Collectors make use of a technique called bump allocation whereby allocation is carried out by adding the required allocation size to the current allocation pointer (the bumping bit) and checking whether this exceeds the limit for the area the allocation pointer is pointing to. Bump allocation comes with many benefits.

Firstly, if enough registers are available on the architecture then the allocation pointer can be permanently kept in one. The conditional to check whether there is sufficient space for the allocation is also going to be correctly predicted by the CPU nearly all the time which means when there is space available (the fast path) allocation is very cheap. This is in contrast to an allocator (GC or a malloc implementation) that uses lists of free blocks in memory, where an allocation might involve walking some parts of a linked list. Secondly, the allocation sequence of instructions is very short which makes it amenable to inlining in to allocation sites.

TLABs

One complexity comes when running in an environment where threads might want to allocate in parallel. Simply having all threads compete to bump the pointer using Atomic instructions like Fetch-And-Add or Compare-And-Swap would be very inefficient. Such atomic operations are significantly slower than simple memory stores which benefit from a number of latency-hiding optimisations by the processor that can’t be applied to atomic instructions. Additionally, many threads competing to write to a single memory location at high frequency will generate a significant amount of traffic between cpu caches. A solution to this problem used by Java Garbage Collectors are Thread Local Allocation Buffers (TLABs).

A TLAB is a buffer capable of servicing many allocations and is owned by a thread. Threads request these buffers from the Garbage Collector’s allocator and then use them to allocate new objects into. Since the buffers are local to the thread and allocation within them is single threaded, the fast bump allocation method can be used. The JVM dynamically chooses the size of TLABs, giving smaller TLABs to threads doing little allocation and larger ones to threads that allocate more frequently. Aleksey Shipilёv has a good set of posts on TLABs that are worth reading.

Prefetching

Now you understand how allocation works, where does prefetching come in? To understand prefetching you first need to know a little bit about how CPU caches are organised. Modern CPUs have a hierarchy of caches between the core and main memory, moving from big and slow to small and fast. For example, the Ryzen 9 3900X has a first level data cache (L1) of 32kb per core, a second level cache (L2) of 512kb per core and a third level cache (L3) of 64mb shared between cores. Caches are organised into cache lines which are aligned blocks of memory usually 64 bytes on modern x86 processors.

To maintain a consistent view of memory, all of the caches sit on a bus and obey a cache coherency protocol. This protocol enables caches to share data to avoid needing to go to main memory if the data is present in another core’s cache, it also takes care of invalidating stale when a core updates a value in memory. For a deeper dive, Martin Thompson and Fabian Giesen have good posts on cache coherence protocols.

Allocation without prefetching

With what we know about allocation and caches, let’s walk through what happens during an allocation without prefetching:

Bump the allocation pointer by the required allocation size, remembering to check that we’re not over the size of our TLAB
Write the object’s header (a mark word and a pointer to the object’s class)
Initialize the object’s fields or zero them

A few things happen in the background at step 2.

The first is that the recently allocated block of memory is very unlikely to be present in any caches. This means that the first time it is written to, the CPU will likely need to do a load from main memory. This may seem unnecessary but the object’s header is only 12 or 16 bytes and a cache line is 64 bytes. A load from memory is required to populate the rest of the line. This won’t block the write though as stores go in to a store buffer to be applied at a later point in time. In x86 processors for the last decade or so most loads can actually be retrieved straight out of the store buffer if they are for recently written data. Older generations of processors could do this too but with restrictions around the size and alignment of loads and stores. Store buffers are of limited size however, once they are full subsequent stores can cause the processor to stall.

One big problem with reading data into the caches before writing is cache pollution, you might have data sat in the second or third level caches which you don’t really need to be there.

The second thing that happens behind the scenes at step 2 is the current core’s cache signals to other caches that it intends to write to a particular cache line and that they should invalidate their entries for that cache line if they exist.

On step 3, object fields are either initialized by the object's constructor or zeroed which results in a series of stores and this can often be a major bottleneck in allocation itself. Aleksey Shipilёv has a good article on initialization costs in JVM allocation for more depth.

Allocation with prefetching

Prefetching involves using one of a family of instructions that hints to the processor that a particular block of data (usually a cache line) will be operated on in a particular way shortly. You can, for example, ask the processor to prefetch a particular cache line in anticipation of a write to that line and to minimise cache pollution in doing so. The x86 prefetch instructions are PREFETCHT0, PREFETCHT1, PREFETCHT2, PREFETCHW and PREFETCHNTA.

PREFETCHTx hint to the processor that a particular cache line will be needed for reading soon and it should be kept in increasing levels of closeness to the core (x can be 0, 1 or 2 for L1, L1/L2 and L2).

PREFETCHW hints to the processor that a particular cache line will be needed for writing soon, so it needs to be brought close to the processor. While implementation-specific, this may avoid some cache pollution in the higher caches.

PREFETCHNTA is a non-temporal prefetch which hints to the processor to fetch a cache line in a way that minimises cache pollution.

By issuing a prefetch instruction ahead of needing to write to a particular cache line you can ensure it has already been fetched, that the nearest cache has taken ownership and potentially minimise the amount of cache pollution.

So how does prefetching work with -XX:AllocatePrefetch? We now add a new step to the previous sequence of operations around an allocation:

Bump the allocation pointer by the required allocation size, remembering to check that we’re not over the size of our TLAB
Prefetch ahead of the allocation pointer some distance
Write the object’s header (a mark word and a pointer to the object’s class)
Initialize the object’s fields or zero them

In step 2 we issue one or more prefetch instructions at some distance ahead of the TLAB allocation pointer. Crucially this means we’re not prefetching the current allocation (we’re about to write to it a few instructions later, the prefetch would not be useful) but some allocation in the future.

With that context, here’s what the various arguments actually do.

-XX:AllocatePrefetchStyle

This controls whether allocation prefetching is enabled or not, and if so what type of prefetching is involved. For parameters:

0 - disables prefetching
1 - prefetches after each allocation (the default and what you saw as 1a earlier)
2 - checks after each allocation whether a watermark has been hit in the TLAB and if so, issues a prefetch before resetting the watermark
3 - prefetch after each allocation but cache align prefetched addresses. Official documentation claims this is only applicable to Sparc but there doesn’t seem to be anything restricting it as such

-XX:AllocatePrefetchDistance

This controls the distance ahead of the allocation pointer that is prefetched. The default value of -1 means the distance depends on the family of processor (the values come from here: https://github.com/openjdk/jdk/blob/dce5f5dbc804e6ebf29ced5a917d5197fa43f551/src/hotspot/cpu/x86/vm_version_x86.hpp#L909), on modern x86 cores from the last decade this ends up being 192 bytes. While the family of processor has an effect on the prefetching distance, the allocation rate will has a much bigger one. A thread that does little to no allocation may end up prefetching and then evicting cache lines before they’re actually allocated in the TLAB. This situation results in wasted memory bandwidth, extra cache pollution and with PREFETCHNTA, a potential extra trip to main memory due to the line no longer being present in the second and third level caches.

-XX:AllocatePrefetchLines, -XX:AllocateInstancePrefetchLines and -XX:AllocatePrefetchStepSize

AllocateInstancePrefetchLines and AllocatePrefetchLines are the number of cache lines to prefetch for instances and arrays respectively. AllocatePrefetchStepSize is the size to step for each line. The following pseudo code will show how this works:

generatePrefetchInstructions( Pointer allocationPointer, Boolean isArray ) {
Pointer current = allocationPointer + AllocatePrefetchDistance;

Int lines = isArray ? AllocatePrefetchLines : AllocateInstancePrefetchLines;

    for( i = 0 ; i < AllocatePrefetchLines ; i++ ) {
        emitPrefetchInstruction( current );
    current += AllocatePrefetchStepSize;
    }
}

Interestingly the documentation of the default for AllocatePrefetchStepSize on x86 at least is incorrect and it defaults to the L1 cache line size.

-XX:AllocatePrefetchInstr

This selects the prefetch instruction to generate. On x86 (and supported), these are:

0 - PREFETCHNTA
1 - PREFETCHT0
2 - PREFETCHT2
3 - PREFETCHW

Seeing AllocatePrefetch in action

To see AllocatePrefetch working, we’ve written a few small benchmarks that test with different sized objects so we can see the effect of having AllocatePrefetch on and off.

We use the JMH harness for running the performance tests and for calculating the size of objects we use JOL. Both of these tools are excellent and a must for JVM microbenchmarks.

The test objects we allocate are:

class CacheSizedObj {
    public int p00, p01, p02, p03, p04, p05, p06, p07, p08, p09, p10, p11;
}

The CacheSizedObj is created to result in an object 64-bytes in size (12 byte header + 12 4-byte ints + 4 bytes padding).

Then there’s a SmallObj:

class SmallObj {
    public int p01, p02, p03, p04, p05, p06, p07, p08, p09;
}

Which is 48 bytes in size (12 byte header + 9 4-byte ints)

Finally there’s LargeObj:

class LargeObj {
    public int p01, p02, p03, p04, p05, p06, p07, p08;
    public int p11, p12, p13, p14, p15, p16, p17, p18;
    public int p21, p22, p23, p24, p25, p26, p27, p28;
    public int p31, p32, p33, p34, p35, p36, p37, p38, p40;
}

Which is 144 bytes in size (12 byte header + 33 4-byte ints)

Next we have the benchmarks:

@State(Scope.Benchmark)
public class AllocationBenchmark {
    @Benchmark
    @Fork(jvmArgsAppend = {"-XX:AllocatePrefetchStyle=0"})
    public CacheSizedObj allocateCacheAlignZero() {
        return new CacheSizedObj();
    }

    @Benchmark
    @Fork(jvmArgsAppend = {"-XX:AllocatePrefetchStyle=0"})
    public LargeObj allocateLargeObjZero() {
        return new LargeObj();
    }

    @Benchmark
    @Fork(jvmArgsAppend = {"-XX:AllocatePrefetchStyle=0"})
    public SmallObj allocateSmallObjZero() {
        return new SmallObj();
    }

    @Benchmark
    @Fork(jvmArgsAppend = {"-XX:AllocatePrefetchStyle=1"})
    public CacheSizedObj allocateCacheAlignOne() {
        return new CacheSizedObj();
    }

    @Benchmark
    @Fork(jvmArgsAppend = {"-XX:AllocatePrefetchStyle=1"})
    public LargeObj allocateLargeObjOne() {
        return new LargeObj();
    }

    @Benchmark
    @Fork(jvmArgsAppend = {"-XX:AllocatePrefetchStyle=1"})
    public SmallObj allocateSmallObjOne() {
        return new SmallObj();
    }
}

We do several runs of JMH to gather timings, assembly and performance counter data. Just focussing on CacheSizedObj’s two tests, here’s what the assembly looks like when -XX:AllocatePrefetchStyle is set to 0 and so allocation prefetching is disabled:

// Simplifying and focusing on the allocation code itself
mov  0x118(%r15),%rax // get the TLAB allocation pointer
mov     %rax,%r10   // copy it to the RAX register
add     $0x40,%r10  // now add the size of the allocation (64 bytes for CacheAligObj)
cmp     0x128(%r15),%r10 // check if it's within the TLAB's capacity
jb       // if it is, go initialize it
… deal with allocating a new TLAB here …
initialize_allocated:
mov     %r10,0x118(%r15) // set the updated TLAB allocation pointer
mov     0x8(%rsp),%r10
mov     0xb8(%r10),%r10
mov     %r10,(%rax) // write the mark word
movl    $0x170357,0x8(%rax) // write the klass pointer
movl    $0x0,0xc(%rax)  // initialise fields p00 and p01
movq    $0x0,0x10(%rax) // rinse repeat..
movq    $0x0,0x18(%rax)
movq    $0x0,0x20(%rax)
movq    $0x0,0x28(%rax)
movq    $0x0,0x30(%rax)
movq    $0x0,0x38(%rax) // p10 and p11

Now if we set -XX:AllocatePrefetchStyle to 1 we get:

// Simplifying and focusing on the allocation code itself
mov  0x118(%r15),%rax // get the TLAB allocation pointer
mov     %rax,%r10   // copy it to the RAX register
add     $0x40,%r10  // now add the size of the allocation (64 bytes for CacheAligObj)
cmp     0x128(%r15),%r10 // check if it's within the TLAB's capacity
jb       // if it is, go initialize it
… deal with allocating a new TLAB here …
initialize_allocated:
mov     %r10,0x118(%r15) // set the updated TLAB allocation pointer
prefetchnta 0xc0(%r10) // Our prefetch instruction
mov     0x8(%rsp),%r10
mov     0xb8(%r10),%r10
mov     %r10,(%rax) // write the mark word
movl    $0x170357,0x8(%rax) // write the klass pointer
movl    $0x0,0xc(%rax)  // initialise fields p00 and p01
movq    $0x0,0x10(%rax) // rinse and repeat..
movq    $0x0,0x18(%rax)
movq    $0x0,0x20(%rax)
movq    $0x0,0x28(%rax)
movq    $0x0,0x30(%rax)
movq    $0x0,0x38(%rax) // p10 and p11

The JVM emits a PREFETCHNTA instruction that prefetches 0xC0 in hexadecimal or 192 in decimal bytes ahead of the allocation pointer.

So what effect does this have on performance? Below is a table showing the allocations per second for each of the six combinations on an i7 4770:

Object Type	-XX:AllocatePrefetchStyle	Allocations per second
`CacheSizedObj`	0	107,771,103 ± 267,004
`CacheSizedObj`	1	132,073,910 ± 317,658
`SmallObj`	0	123,750,800 ± 4,053,955
`SmallObj`	1	153,603,237 ± 4,110,906
`LargeObj`	0	57,323,333 ± 1,919,061
`LargeObj`	1	57,823,700 ± 125,555

We can see that the gain for CacheSizedObj is almost 30%. That’s a massive increase in throughput. The gain on SmallObj is of a similar order but LargeObj is not. Why? It comes back to how CPU caches are organised. With a 64 byte cache line and a 144 byte LargeObj (which covers >2 cache lines) we will end up writing to addresses which have not been prefetched. We could remedy this situation by increasing the number of lines we prefetch with -XX:AllocateInstancePrefetchLines but care needs to be taken here because if a particular application also includes smaller allocations, we may end up prefetching unnecessarily, wasting memory and polluting the caches.

Why does it work?

Allocation prefetching gives a big throughput increase in the microbenchmark cases where the object we’re allocating is within the cache lines we’re prefetching. The question now is why? The culprit actually ends up being the CPU’s store buffer. As we mentioned earlier, writes to memory end up going in to the store buffer until they can be successfully written to the caches or memory. The buffers are limited in size though and when they are full further stores to memory will cause the CPU to stall.

If we focus on the CacheSizedObjcase and look at some of the performance counters (gathered via JMH using -prof perfnorm) we can see the effect allocation prefetching is having:

-XX:AllocatePrefetchStyle	Cycles (normalised)	Resource_stalls.sb (normalised)
0	35.615	22.253
1	29.744	16.203

In the case without allocation prefetching, the microbenchmark iteration takes ~36 cycles on average but ~22 of those cycles are actually spent stalled waiting on store buffer capacity. With allocation prefetching turned on, the cycles stalled on the store buffer is reduced by nearly 40%. This could potentially be because the cache line is already present in the cache and thus the store buffer can drain quicker.

Putting the results in context

It’s worth interpreting these benchmarks and their results. For starters, these are microbenchmarks - they are testing one specific operation and it’s highly unlikely you will see a performance impact of the magnitude these tests show. To put some of the allocation numbers in context the CacheSizedObjmicrobenchmark with allocation prefetching enabled is allocating 8.5 gigabytes a second and essentially nothing else. That it is bottlenecked by the store buffer is unsurprising given that stores are all it’s doing. In a bigger application there’s a good chance that some operations will actually be performed on the allocated objects, so giving more time for the store buffer to drain.

Recap

In this post you’ve learnt how several parts of the JVM’s Garbage Collection operate: how bump pointer allocation works and how TLABs help avoid contention between threads. On the CPU side you’ve learnt how prefetching works, the various different prefetch instructions on x86 as well as cpu caches and their hierarchical structure. You then saw how these combined in the JVM’s AllocatePrefetch, whereby it would emit prefetch instructions ahead of the allocation pointer. Finally we saw the effect of allocation prefetching on simple allocation microbenchmarks, where in the ideal case of allocating objects that exactly fit in a cache line we saw a nearly 40% increase in allocation performance. Here prefetching ensured the newly allocated memory was in one of the caches and allowed the store buffer to drain much quicker. This avoided a complete stall while waiting for main memory access.

Using and tuning AllocatePrefetch

Now you know how allocation prefetching in the JVM works, how can you use that knowledge?

To begin with you need to understand where the real bottlenecks in your application are. These are very difficult to evaluate outside of a production environment. Opsian’s Continuous Profiling service was a great way of identifying them. If a bottleneck involves significant allocation in very tight loops, it may be a candidate for tuning. Next identify the kind of object allocations that are performed - their sizes and relative frequencies. Choose allocation prefetch settings that ensures the most frequently allocated objects are completely prefetched. Finally, test your new build against your existing one in a production environment. Use metrics to identify if there’s been an improvement in performance and something like Continuous Profiling to see if the bottlenecks have shifted.