In Part 2 of Opsian's interview with Aleksey Shipilëv we discuss his work with Shenandoah, the concurrent Garbage Collector for Java and the reasons why we're seeing a new wave of state-of-the-art concurrent Garbage Collectors.
Richard: So let's move on to the other half of your work at Red Hat which involves Shenandoah. Perhaps you could just do a brief introduction of what Shenandoah is for readers out there?
Aleksey: Sure. Shenandoah is a project that targets to create a low latency Garbage Collector for Java. It has a long development history, it started way before I joined Red Hat and that project. It is a regionalised collector like G1 and it is fully concurrent, meaning that it has all the concurrent phases. G1 has had concurrent mark for a while, but Shenandoah also does concurrent compaction, and then concurrent update references that solve the rest of the pause problem for the application.
You can go without those other concurrent phases like CMS does, for instance. CMS does concurrent mark and sweep, but then you get memory fragmentation at some point, and you will have to face the music, actually dive into stop-the-world compaction of the heap to resolve those fragmentation issues.
We've been doing Shenandoah for a while, and it left us with lots of JDK flavors. Since we are in this business to help customers, we are not only doing the current JDK version of Shenandoah (which is JDK 13 at this point). We upstreamed it in JDK 12 and maintain it in mainline since then, and we also ship backports to 8u and 11u because this is where our users are.
Aleksey: That turns out to be a real problem for users. Technically, we could say that if you want a low latency collector, you have to update to the newest JDK, but this argument usually lands on deaf ears because updating the JDK, even after 9, is tedious.
Richard: So, what would you say the scenarios are where Shenandoah is best? What would you say were the scenarios where people should try and use Shenandoah versus scenarios where maybe it’s weaker?
Aleksey: One important thing to realize is that OpenJDK collectors are actually quite nice. In many cases, you can get away with fully stop-the-world collectors like Parallel or Serial if your application allows it. If your pauses from Parallel are not fitting your requirements, I would say then you can go to G1, and starting with JDK 9 you would get it by default. If you’re still not meeting your pause time guarantees and the GC pauses are still significant contributors to the latency that break your performance targets, then you will have to consider a fully concurrent GC. This is where Shenandoah comes in.
It routinely runs with pauses around and about one millisecond. For some misbehaving applications, or applications that do lots of weird stuff around finalizers and suchlike, you can have larger pauses, but normal regular Java applications usually enjoy very, very, very low pause times. Now, it is not specific to Shenandoah, it is a property of concurrent collectors.
If you have a system where you have lots of parallelism, meaning you have lots of free CPUs, and you can just say “I have 16 CPUs, and I can dedicate 2 of those CPUs for doing GC work”, then it might be a better tradeoff for you than accepting a stop-the-world pause in 8 application threads.
These are some of the surprising things that our users have in their testimonials. They just say, “Well, we thought that there would be a throughput decrease from concurrent collector because you have to do more barrier work. What really happens is that we offload the GC work to the free CPUs out there, so our workload wall-time was actually lower because of that”.
You also might think that pauses are a problem on larger heaps. There is the common wisdom that if you go tens of gigabytes or terabytes of heap, you need to have a concurrent GC. This is only half-true though, because you can have a workload which has a 1 TB heap and still manages to do sub-millisecond pauses with a stop-the-world collector. For instance, if everything dies young all the time, the GC would be very, very fast.
If you have a large live data set which requires GCs to spend a while processing all of it, marking through it, moving it around, etc. then you’d prefer to have a concurrent GC.
Also consider that pause time targets and requirements are usually expressed in human units like milliseconds, but the time you spend doing GC work depends not only on the amount of data, but also on how fast your hardware is. For example, you can have some really, really slow hardware like the Raspberry Pi which has “only” 1 GB of memory, but it is also so slow that the tiny amount of GC work to actually process that gigabyte of live data can actually add up to a large pause. At that point, you also want to have concurrent GC because you want to minimize the time you spend instruction-wise in the pause.
So yes, on slow hardware or heaps with large live data sets. To our surprise, there are users who are using Shenandoah for compact heaps. Because one of the things about concurrent GCs, especially regionalised GCs, is that you can have very simple memory uncommit policies that do not interfere with the application. With Shenandoah, if you have a region that has not been used for five minutes you can uncommit the memory under it, thus shrinking the heap without doing any GC. You can do this asynchronously with the rest of the application. You will still have latency stalls when you commit that memory back, which is why you didn't want to uncommit all the time.
Most of the time before you do an uncommit you want to do a GC to actually compact the heap a bit. For example, if you have 100 regions and each of those regions has only 1% of live data in them, then you would like to move that data out of those scarcely used regions into a single new one, so that you can uncommit the entire space under the first 100 regions.
This is where a stop-the-world collector fails you because you need to accept a pause to perform that move. If you look at how people try to avoid that for stop-the-world collectors, you’ll see some crazy stuff. The problem is that without future-predicting, you cannot really say if starting a stop-the-world GC right now will interfere with some request that will come in a millisecond later, right?
So, you end up with a heuristic-based prediction. Like looking at load averages: if load average is low, that probably means we're in an idle phase and probably this is a good time to do the stop-the-world GC, or even a full stop-the-world GC, hoping that nothing else comes in and gets stuck waiting for GC to be over.
For concurrent GC, it is less of a problem because application is still able to run when GC is running. In fact, you can start a GC just because you feel like it. Many concurrent GC implementations I know of, have this sort of periodic, or proactive GCs. They say, “we’ve been idle for five minutes, maybe it is time to do a GC cycle to remove garbage and make sure that we have enough space for future allocations”. Since it is concurrent, it costs you CPU cycles, but it doesn't cost you much latency, so you can just do it.
Richard: There is this whole other space of getting more compact heaps and getting more efficient use of memory resources that comes with concurrent GC.
Richard: That’s probably a lot more useful with things like Docker and modern deployments with containerization-type situations.
Aleksey: Yes. To our surprise, we've seen lots of interest from the folks who do dense cloud-like environments. They care less that Shenandoah is concurrent and more that Shenandoah can uncommit memory more aggressively than other collectors. You can explain to them one of the reasons why Shenandoah can do this is because it is concurrent, but that they don't care about that, right?
We even have a magic option that says “run in the most compact mode”, and the rest just tunes itself to be as compact as you can possibly do.
Richard: So, another garbage collector that seems to have a bit of excitement going around it at the moment that's in OpenJDK is ZGC, which seems to have a lot of similar goals to Shenandoah as a project but has been kind open sourced much later. I don't know which started first or started later. How would you compare ZGC and Shenandoah in their current states?
Aleksey: I think they're pretty similar in their goals and in their results. I mean, depending on your application one or the other might give you a slight edge. The important thing to realize here is that Shenandoah and ZGC are the entire different class of collectors compared to the other OpenJDK collectors. So, if you have a GC latency problem, jumping from a stop-the-word collector or half-concurrent collector to a fully concurrent collector would be the major improvement.
The differences between those two implementations after you’ve made the jump are probably not that essential to bother, at least in the first place. So, I would say it’s not the technical things that win there, but logistics and organizational things.
Richard: Right. And there's also a current situation where Oracle, they don't enable Shenandoah in their downstream OpenJDK builds. Is that right?
Richard: So, which downstream OpenJDK builds are enabling Shenandoah at the moment?
Aleksey: All collectors are built by default in OpenJDK, so unless you specifically disable any, all collectors will be there. I don't know any other vendor that specifically disables Shenandoah in their OpenJDK builds, so I would expect that if you take just about any OpenJDK build except Oracle’s, then Shenandoah would be there.
Richard: Right. And so you've got people use Shenandoah, do you view it as a stable collector. I mean are they generally happy with it as a collector, are they happy with its stability and its use in a production environment?
Aleksey: Yes, we have users who use this in production. It's kind of hard to tell because when people are happy with your collector, they don't tell you.
Richard: Yeah. Same with any product actually.
Aleksey: We know that people run things in production with Shenandoah when they come to us with the problems that they face in production. Over the years, the problems stopped being critical. Minor things, like heuristics made a mistake or monitoring is reporting something that should not report or something like that. There are sometimes bugs and some of them are not too scary, but still important to fix. I would say it is a pretty stable collector and people use it in production without much of a problem.
For us, our development process is built in such a way that if you experience a problem with Shenandoah, that would probably be 8u or 11u that you use in production. Since our code bases are pretty much synced across the JDK releases, the same bug probably happens in all Shenandoah versions, and we will try to fix it in all the JDK releases at once.
This means, there is a rather low probability that you will find yourself in a situation when you’re facing a GC bug that requires you to upgrade to the next JDK to get the fix.
Richard: Right. So, that's quite a big maintenance advantage for users that they do not have to worry about bumping their JDK versions.
Aleksey: Yeah. Remember I was talking about the choice between concurrent collectors is mostly the logistics and the organizational considerations?
Aleksey: This is one of those things. Just stop and think through what happens when you find an issue in the collector, how fast would you get the fix, or if you would get the fix at all in your current release stream, and how it fits your deployment and update strategy, etc, etc.
Richard: Yeah, very interesting. What would you say the technical advances under the hood are in Shenandoah that enable it to be a concurrent GC and make it a performant GC?
Aleksey: I usually do this pitch in my GC talks. If you think that GC are innovative, well think again. There is “The Garbage Collection Handbook”, where most of the known GC techniques are explained. That book is, by CS standards, quite old. The latest edition that I have is from around 2006.. So, Shenandoah uses the approach that Brooks suggested in the late 70s, I think in the Lisp context. It says if you move an object, you can, but then we need to record where the new object is, and somehow tell the application to use one of those two copies all the time. In other words, you have to forward information somewhere.
Brooks suggested that since we are controlling the layout of objects anyway, why don't we reserve a slot in the object to store a pointer which points to the most actual copy of that object? Then the GC barriers, which are tiny chunks of code that execute in the application when it accesses the Java heap, would arbitrate which copy should you read from or write to and do things if you are in the wrong state.
For instance, if you want to write in an object that is in the condemned relocation collection set, then you want to copy the object first and then write to the promoted copy, and the read barriers will just say, “Hey, you know, if you have two copies, then select the most actual one.”
The design idea is pretty basic as far as concurrent collectors go. The differences between algorithms are usually where you store that forwarding data and where you put the barriers to manipulate that data. All these minor design/implementation details might add up to something big, and the design space is quite large. You just hand pick this and that option and then see how that works. See for example how gradual improvements in implementation add up, in Roman Kennke's posts on Shenandoah GC in JDK 13: Part 1: Load reference barriers, Part 2: Eliminating the forward pointer word, Part 3: Architectures and Operating Systems.
Richard: So, Brooks barriers are from the 1970s and JDKs have had GCs since the mid-90s and Shenandoah was released a few years ago. Why did no one else do a Brooks barrier based OpenJDK GC implementation during the intervening years?
Aleksey: Well, beats me. I don't know for sure, but I think there is a chicken-and-egg problem lurking here. GCs are simple in principle, but still require quite a lot of engineering to produce a reliable high-performance implementation, especially the fully-concurrent one. So, the cost of implementing one is not trivial.
On the other hand, the benefits are offset by current development practice. Most of GC tuning recommendations are targeted towards either avoiding Full/Old GC in generational collectors, or avoiding GC completely. For example, Realtime Java was doing all sorts of tricks trying to isolate itself from GC. For "regular" Java, the performance tuning fueled the allocation-/heap-conscious development practices on critical paths, ultimately even going off-heap. Hardware got generally faster too, so GC cycles (especially young GCs) appear quite fast. Hardware got beefier too, so you can allocate enough heap to prolong the time till next GC. Ultimately, combining all these, you can even disable the memory reclamation completely, see Epsilon GC for example.
So, who would like to invest in building a concurrent GC? There is a fairly narrow need due to the reasons above. Fortunately, this balance is not static, and some JVM implementors can make the initial push through this stalemate. In principle, long-available concurrent GCs implementations (honorary mention: Azul's GCs) could have resolved this, but they did not, probably because users would have to switch from the default/free distribution. Shenandoah and ZGC are more widespread in default/free Java distributions, making them useful for the much broader developer audience. This is how more people would get much more exposure to the technology, washing away parts of the original chicken-and-egg conundrum. Having a development practice shift from "Do crazy things to avoid GC problems" to "Have a concurrent GC, and then maybe optimize for it a bit" would be the global improvement for all of us.
This concludes our chat with Aleksey. As always it’s been interesting and we’d like to thank Aleksey for taking some time out and sitting down with us to share his thoughts.