Solving Memory Leaks without Heap Dumps

Sometimes you may not want to do a heap dump. You may be running in an environment which is sensitive to latencies. Or you may be forbidden to create heap dumps, since the content will contain all your customer information and all of your organization’s account numbers, and if the dump ended up in the wrong hands, your entire business would be done for. Or you may have an 800+GB heap (yes, some customers run Java with enormous heaps with great success). And even worse, you may have a huge heap, with a relatively small ephemeral disk storage, not even able to store your huge heap dump. And, quite frankly, even if you get your 800+GB heap dump to your puny laptop, how will you open it? How much time will it take to calculate a dominator tree over that dump?

No matter the reason for you not wanting to do a heap dump, there is now (well, since JDK 10 really), a new JFR event allowing you to solve memory leaks without having to do full heap dumps with very little overhead. Black magic you say? Yes, awesome, yummy, black magic.

The Old Object Sample Event

At the heart of the red pentagram (with a black wax candle on each point and encircled with salt) is the Old Object Sample event. It was introduced in JDK 10. It basically tracks a fixed number of objects on the heap, for as long as they are live. To not incur massive overhead, they are selected in a similar way that the allocation event samples are picked – upon retiring a TLAB, or when allocating outside of TLABs. So, a sampled subset of the allocations get tracked.

When a sample is chosen, the allocation time gets stored together with the allocation stack trace, the thread id, the type of object being allocated, and the memory address of the object. If it’s an array, we also record the array size,

The samples are then stored in a fixed size (256 by default) combined priority queue/linked list, with weak references to the samples. If sampled objects are garbage collected, they are removed and the priority redistributed to the neighbours. The priority is called span, and is currently the size of the allocation, giving more weight to larger (therefore more severe) leaks.

Once the recording is dumped, the paths back to the GC roots can be calculated. I write can, since this is optional – it is something that must be enabled in the recording, or as a parameter to e.g. jcmd when dumping the recording. If the reference chain is very deep (>256 object references), the reference chain will be truncated. It is also possible to specify a time budget, so that the time searching references can be limited. For example, imagine a linked list occupying most of the heap, and the sampled object being the tail of that list. The reference chain for that tail sample would span almost the entirety of the heap. With a large time budget, you would still get a truncated sample. If you don’t want to spend so much time searching the heap, you could limit the time budget.

In other words, the Old Object Sample event contains a lot of exciting information:

  • Time of allocation
  • The thread doing the allocation
  • The last known heap usage at the time of allocation
    (Which can be used to plot the live set, even if we don’t have data from the time of allocation anymore.)
  • The allocation stack trace
    (In other words, where was this object allocated?)
  • The reference chain back to the GC root at the time of dumping the recording
    (In other words, who is still holding on to this object?)
  • The address of the object

There is some additional information. You can check out IMCOldObject in the OpenJDK JMC project source for more details.

Here is an Old Object Sample event shown in the JMC 7 Properties view:

image

Using the Old Object Sample Event

The best way to use the Old Object Sample event is to use it in a long running application. The longer the better. Statistically speaking, you want to offer as many chances as possible for a leaked object to end up being sampled. You’d also want to be well beyond the loading of all your code. Also, you would want to have been running long enough to be sure that transients have been cleared out. For example, if you have a session time-out of some kind set to 2 hours, and a ginormous application server and even larger application taking 15 minutes to start, then the first 2 hours and 15 minutes of runtime will not be that exciting from a memory leak hunting perspective.

A simple way of using the event is to simply go look for events still around after the warmup phase, but before transient objects could reasonably still be around. An even simpler rule of thumb – look at the ones allocated in the middle of the time span. Winking smile

image

Since there is currently a bug open on JMC 7 (JMC 7 has not been released yet; we hope to fix it before we release), “picking the middle” is not yet possible. That said, in the picture above we can see that most live objects being tracked are actually held on to by the Leak$DemoThread, which has a Hashtable (what can I say, it’s a really old example program), having an entry array, containing an entry holding on to a Leak$DemoObject which in turn holds on to a leaked char[].

Now, JMC has a more sophisticated algorithm for selecting good candidates than “go for the ones in the middle”. It first check if we have an increasing live set. If so, and if we have Old Object Sample information, we will try to find good candidates using a combination of the distance from the root, the ratio of how many objects this candidate keeps alive to how many objects its root keeps alive and the ratio of how many objects the candidate keeps alive to how many objects are alive globally. For more information, check out the ReferenceTreeModel in the JMC project.

This has already become a much longer post than I was planning on. Anyways, if you want to experiment a bit with the Old Object Sample event, I have an upcoming JMC and JFR Tutorial that I am planning on “releasing” when JMC 7 is out. That said, you can already beta test it. There is some more information in the blog entry prior to this one.

The Practical Guide to the Old Object Sample Event

If you use the continuous template, this is recorded:

  • Timestamp
  • Thread
  • Object Type

If you use the profile template, this is recorded:

  • Timestamp
  • Thread
  • Object Type
  • Allocation stack trace

If you ask for paths-to-gc-roots you also get the reference chains. This can be done by:

  • Adding it as a parameter on the command line:
    -XX:StartFlightRecording=path-to-gc-roots=true
  • By asking for it when dumping the flight recorder, for example using jcmd:
    jcmd <pid> JFR.dump path-to-gc-roots=true

You can also configure the number of objects to track by setting the old-object-queue-size in the flight recording options, for example:

-XX:FlightRecordingOptions=old-object-queue-size=256

If you want to configure the cutoff for how long to search for references, that can be done in the template file, for example, these are the default settings in the profile template (JDK_HOME/lib/jfr/profile.jfc):

    <event name="jdk.OldObjectSample">
      <setting name="enabled" control="memory-leak-detection-enabled">true</setting>
      <setting name="stackTrace" control="memory-leak-detection-stack-trace">true</setting>
      <setting name="cutoff" control="memory-leak-detection-cutoff">0 ns</setting>
    </event>

Summary

  • The Old Object Sample event is awesome
  • It can, among other things, be used to hunt down memory leaks without doing hprof heap dumps
  • It will also bring you luck, good fortune, not to mention smells good

5 Responses to "Solving Memory Leaks without Heap Dumps"

  1. I wonder what the memory-leak-detection-cutoff value really does?

    Its default value appears to be 1 hour. Is it?

    Why do I ask this?

    We do (well, our app does) suffer from rather slow memory leaks (say day after day). And it has a heap size which makes it rather impractical to use heap dumps.

  2. Marcus says:

    The setting defines the time budget to allow for searching reference chains to the GC roots for the leak candidates.
    The default is off, but if used with JMC, there is a “Object Types + Allocation Stack Traces + Path to GC Root” option
    you can select in the UI, defined in the template, which indeed sets it to an hour.

    <selection name="memory-leak-detection" default="minimal" label="Memory Leak Detection">
    <option label="Off" name="off">off</option>
    <option label="Object Types" name="minimal">minimal</option>
    <option label="Object Types + Allocation Stack Traces" name="medium">medium</option>
    <option label="Object Types + Allocation Stack Traces + Path to GC Root" name="full">full</option>
    </selection>

    <condition name="memory-leak-detection-cutoff" true="1 h" false="0 ns">
    <test name="memory-leak-detection" operator="equal" value="full"/>
    </condition>

    That might be a bit excessive time to walk around on the heap chasing references, halted, so you can of course define it however you want to.

  3. Thank you very much @Marcus!

    That makes sense indeed and explains JFROldObject appearing under VM Operations Peak Duration at the very end of a dump (2 x 11 seconds in my case, which isn’t too bad at all).

    It appears to take the walk twice though, if the recording was both started with “Object Types + Allocation Stack Traces + Path to GC Root” and dumped with “path-to-gc-roots=true”.

    BTW: Were there any reports of Flight Recorder causing SIGSEGVs (either within G1GC or Flight Recorder itself using Zulu JDK8u275)? We can’t get Flight Recorder to run for longer than a couple of hours during business hours before the VM crashes – suspiciously enough immediately after having completed a GC caused by a Humongous Allocation.

  4. Marcus says:

    Walking twice seems funny – are you sure it’s not the case that you have a chunk rotation happening, and configured to dump once per chunk? The problem with Zulu is something you should report to Azul Systems, they would likely be more than happy to help you analyze any dumps.

  5. Vivian says:

    Hi Marcus,Is this just a part of all Old Obj? Is there any way to view all Old Obj?

Leave a Reply

Your email address will not be published. Required fields are marked *