Solving Memory Leaks without Heap Dumps

Sometimes you may not want to do a heap dump. You may be running in an environment which is sensitive to latencies. Or you may be forbidden to create heap dumps, since the content will contain all your customer information and all of your organization’s account numbers, and if the dump ended up in the wrong hands, your entire business would be done for. Or you may have an 800+GB heap (yes, some customers run Java with enormous heaps with great success). And even worse, you may have a huge heap, with a relatively small ephemeral disk storage, not even able to store your huge heap dump. And, quite frankly, even if you get your 800+GB heap dump to your puny laptop, how will you open it? How much time will it take to calculate a dominator tree over that dump?

No matter the reason for you not wanting to do a heap dump, there is now (well, since JDK 10 really), a new JFR event allowing you to solve memory leaks without having to do full heap dumps with very little overhead. Black magic you say? Yes, awesome, yummy, black magic.

The Old Object Sample Event

At the heart of the red pentagram (with a black wax candle on each point and encircled with salt) is the Old Object Sample event. It was introduced in JDK 10. It basically tracks a fixed number of objects on the heap, for as long as they are live. To not incur massive overhead, they are selected in a similar way that the allocation event samples are picked – upon retiring a TLAB, or when allocating outside of TLABs. So, a sampled subset of the allocations get tracked.

When a sample is chosen, the allocation time gets stored together with the allocation stack trace, the thread id, the type of object being allocated, and the memory address of the object. If it’s an array, we also record the array size,

The samples are then stored in a fixed size (256 by default) combined priority queue/linked list, with weak references to the samples. If sampled objects are garbage collected, they are removed and the priority redistributed to the neighbours. The priority is called span, and is currently the size of the allocation, giving more weight to larger (therefore more severe) leaks.

Once the recording is dumped, the paths back to the GC roots can be calculated. I write can, since this is optional – it is something that must be enabled in the recording, or as a parameter to e.g. jcmd when dumping the recording. If the reference chain is very deep (>256 object references), the reference chain will be truncated. It is also possible to specify a time budget, so that the time searching references can be limited. For example, imagine a linked list occupying most of the heap, and the sampled object being the tail of that list. The reference chain for that tail sample would span almost the entirety of the heap. With a large time budget, you would still get a truncated sample. If you don’t want to spend so much time searching the heap, you could limit the time budget.

In other words, the Old Object Sample event contains a lot of exciting information:

  • Time of allocation
  • The thread doing the allocation
  • The last known heap usage at the time of allocation
    (Which can be used to plot the live set, even if we don’t have data from the time of allocation anymore.)
  • The allocation stack trace
    (In other words, where was this object allocated?)
  • The reference chain back to the GC root at the time of dumping the recording
    (In other words, who is still holding on to this object?)
  • The address of the object

There is some additional information. You can check out IMCOldObject in the OpenJDK JMC project source for more details.

Here is an Old Object Sample event shown in the JMC 7 Properties view:

image

Using the Old Object Sample Event

The best way to use the Old Object Sample event is to use it in a long running application. The longer the better. Statistically speaking, you want to offer as many chances as possible for a leaked object to end up being sampled. You’d also want to be well beyond the loading of all your code. Also, you would want to have been running long enough to be sure that transients have been cleared out. For example, if you have a session time-out of some kind set to 2 hours, and a ginormous application server and even larger application taking 15 minutes to start, then the first 2 hours and 15 minutes of runtime will not be that exciting from a memory leak hunting perspective.

A simple way of using the event is to simply go look for events still around after the warmup phase, but before transient objects could reasonably still be around. An even simpler rule of thumb – look at the ones allocated in the middle of the time span. Winking smile

image

Since there is currently a bug open on JMC 7 (JMC 7 has not been released yet; we hope to fix it before we release), “picking the middle” is not yet possible. That said, in the picture above we can see that most live objects being tracked are actually held on to by the Leak$DemoThread, which has a Hashtable (what can I say, it’s a really old example program), having an entry array, containing an entry holding on to a Leak$DemoObject which in turn holds on to a leaked char[].

Now, JMC has a more sophisticated algorithm for selecting good candidates than “go for the ones in the middle”. It first check if we have an increasing live set. If so, and if we have Old Object Sample information, we will try to find good candidates using a combination of the distance from the root, the ratio of how many objects this candidate keeps alive to how many objects its root keeps alive and the ratio of how many objects the candidate keeps alive to how many objects are alive globally. For more information, check out the ReferenceTreeModel in the JMC project.

This has already become a much longer post than I was planning on. Anyways, if you want to experiment a bit with the Old Object Sample event, I have an upcoming JMC and JFR Tutorial that I am planning on “releasing” when JMC 7 is out. That said, you can already beta test it. There is some more information in the blog entry prior to this one.

The Practical Guide to the Old Object Sample Event

If you use the continuous template, this is recorded:

  • Timestamp
  • Thread
  • Object Type

If you use the profile template, this is recorded:

  • Timestamp
  • Thread
  • Object Type
  • Allocation stack trace

If you ask for paths-to-gc-roots you also get the reference chains. This can be done by:

  • Adding it as a parameter on the command line:
    -XX:StartFlightRecording=path-to-gc-roots=true
  • By asking for it when dumping the flight recorder, for example using jcmd:
    jcmd <pid> JFR.dump path-to-gc-roots=true

You can also configure the number of objects to track by setting the old-object-queue-size in the flight recording options, for example:

-XX:FlightRecordingOptions=old-object-queue-size=256

If you want to configure the cutoff for how long to search for references, that can be done in the template file, for example, these are the default settings in the profile template (JDK_HOME/lib/jfr/profile.jfc):

    <event name="jdk.OldObjectSample">
      <setting name="enabled" control="memory-leak-detection-enabled">true</setting>
      <setting name="stackTrace" control="memory-leak-detection-stack-trace">true</setting>
      <setting name="cutoff" control="memory-leak-detection-cutoff">0 ns</setting>
    </event>

Summary

  • The Old Object Sample event is awesome
  • It can, among other things, be used to hunt down memory leaks without doing hprof heap dumps
  • It will also bring you luck, good fortune, not to mention smells good

Sneak Peek of JDK Mission Control 7 Tutorial

Even though JMC 7 is not GA yet, I thought I’d make the upcoming JMC Tutorial available on my GitHub. Hopefully this will be a good resource to help to learn more about using Mission Control 7 and Flight Recorder in OpenJDK 11.

It does takes a bit of preparation to run it for now:

  • JDK Mission Control will need to be built from source, since there are no update sites available yet
  • JOverflow will not work until JMC-6121 is solved
  • Exercise 5 will be better once JMC-6127 is solved

That said, all the preparations needed are listed in the README.md file in the GitHub repo:

https://github.com/thegreystone/jmc-tutorial

Please let me know if something is missing from the instructions!

My Sessions at Code One 2018

If anyone would like to catch up with me at Code One, here are some specific times where my location is known in advance. 😉

Session Title

ID

Date Start Time End Time

Room

Contributing to the Mission Control
OpenJDK Project

[DEV4506]

Monday,
Oct 22

10:30

11:15

Moscone West
Room 2004
Robotics on Java Simplified

[DEV6089]

Monday,
Oct 22

14:30

15:15

Moscone West
Room 2024

Production-Time Profiling
and Diagnostics on the JVM

[DEV4507]

Wednesday,
Oct 24

10:30

11:15

Moscone West
Room 2004
OpenJDK Mission Control:
The Hands-on-Lab

[HOL4508]

Wednesday,
Oct 24

12:30

14:30

Moscone West
Room 2001A

Diagnose Your Microservices:
OpenTracing/Oracle Application
Performance Monitoring Cloud

[DEV5435]

Wednesday,
Oct 24

16:00

16:45

Moscone West
Room 2011

Getting Started with the
(Open Source) JDK Mission Control

[DEV4509]

Thursday,
Oct 25

11:00

11:45

Moscone West
Room 2014

Note that the last few years, the HoL has been full – it may be a good idea to register for it early. Especially now that JMC/JFR is being open sourced (JDK 11, JMC 7).

Looking forward to seeing you at Code One!

Here is a link to the sessions in the content catalogue.